Abstract
It is well known that a model can generalize even when it completely interpolates the training data, which is known as the benign overfitting. Indeed, several work have theoretically revealed that the minimum-norm interpolator can exhibit the benign overfitting. On the other hand, deep learning models such as two-layer neural networks have been reported to outperform “shallow” learning models such as kernel methods under appropriate model sizes by adaptively learning the basis functions to the data. This mechanism is called feature learning, and it is known empirically to be beneficial even when the model size is large. However, it is generally difficult to show that benign overfitting occurs in learning models with feature learning especially for regression problems. In this study, we then analyze the predictive error of the estimator after one step feature learning in a two-layer linear neural network optimized by gradient descent methods and study the effect of feature learning on benign overfitting. The results show that feature learning reduces bias compared to a one-layer linear regression model without feature learning, especially when the eigenvalues of the covariance of input decay slowly. On the other hand, we clarify that the variance is hardly changed by feature learning. This differs significantly from the results for benign overfitting in the situation without feature learning and indicates the usefulness of feature learning.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In a standard learning theory, it is usually recommended not to overfit to the training data because overfitting leads to poor predictive performance. Then, several regularization or variable selection are usually applied to restrict the degrees of freedom of the model.
However, it has been reported for various models experimentally that overparametrized models generalize very well even when no regularization is applied [5, 16, 21]. This phenomenon is called benign overfitting and has been the subject of significant research in recent years. Theoretical analyses that explain benign overfitting are carried out as Bartlett et al. [4, 19], Hastie et al. [10], Xu and Hsu [20] in the one-layer linear regression model, and Li et al. [15], Louart et al. [14], Mei and Montanari [13] in the random feature model of the two-layer nonlinear neural network.
By the way, it has been reported in various situations that “deep” learning models such as two-layer neural networks outperform “shallow” learning models such as kernel methods under appropriate model sizes [11, 18]. This is considered to be because deep learning models can adaptively learn basis functions, which is known as feature learning. On the other hand, deep learning models show good generalization performance in practice without necessarily choosing an appropriate model size. In particular, the model size for deep learning is often larger than the sample size.
Ba et al. [3] gives an analysis of the predictive error of the estimators after one step feature learning of a two-layer neural network, and shows that the model with feature learning generalizes beyond the lower bound of the predictive error of the linear models such as kernel methods, especially when the sample size is larger than the model size. However, the analysis is given in the proportional limit, and finite-dimensional analysis has not been carried out. In another study, Damian et al. [8] analyzed the learning ability of a one step feature learning model in the setting that the output depends only a low dimensional subspace with dimensionality r in the d-dimensional input space. According to Damian et al. [8], the one-step gradient feature learning makes the model generalize even when \(n \asymp d^2 r + d r^p\) while random feature models require larger sample size \(n \gtrsim d^p\) where p is the degrees of polynomial to be estimated. However, the effect of feature learning in a situation where \(n<d\) is not reported. Therefore, it is an important question to study whether a model with feature learning achieves benign overfitting when the sample size is smaller than the input dimension.
In this study, we investigate three different feature learning procedures and their effect to the predictive performance in the interpolation regime where the model is the two-layer linear neural network. Specifically, the three procedures we investigate are (i) training the first layer by one-step vanilla gradient descent, (ii) that by natural gradient descent and (iii) that by linear regression; and on top of that, we obtain the minimum norm interpolator using the second layer. The predictive performances of the three procedures are analyzed in the finite-dimensional overparametrized regime, where the dimension of the input data is larger than the sample size, and then we compare the performances between the three procedures so that we reveal what kind of feature learning method is effective for benign overfitting. As a summary, we provide the following contributions in this paper:
First, we analyze a situation where the step size for the feature learning is small. We give an explicit formula of the difference between the predictive errors with and without feature learning. Then, we characterize when the feature learning can reduce the predictive error. We also confirm that our theoretical evaluation is consistent with numerical experiments. As a consequence, we see that, in “difficult” instances where the true signal is not well aligned to the input, the NGD can reduce the bias while the GD cannot. This indicates the effectiveness of NGD in the interpolation regime.
Second, we evaluate the generalization error when we employ a larger step size for feature learning. Then, we see that the generalization error after feature learning does not deteriorate compared with that of the non-feature learned linear interpolator. We also give a detailed theoretical analysis that characterizes the differences of predictive errors between the three approaches.
In the former, we find that the reduction of predictive errors induced by feature learning becomes larger as the eigenvalues of the covariance of the input decays more slowly, i.e., the regression problem becomes more difficult. Though it is known that when the eigenvalues are polynomially decaying with \(\lambda _i=i^{-(1+\alpha )}~(\alpha >0)\), the optimal minimax rate can be achieved by one-layer linear regression, we show theoretically and experimentally that feature learning significantly reduces the bias while feature learning does not change the variance when \(\alpha <0\).
Amari et al. [1] which studied the predictive error of linear interpolator with preconditioning reported a tradeoff in which the bias is reduced by preconditioning while the variance is increased. However, in this study, we show that only the bias decreases and the variance hardly increases. This significant difference is because we deal with a linear interpolator computed after feature learning while Amari et al. analyzed only a linear interpolator without feature learning. In addition, Ba et al. [3] analyzed the predictive error after one gradient step feature learning, however, they considered the proportional limit where n and d are taken limit to \(\infty \) where their ratio converges to a positive constant \(n \asymp d\). On the other hand, our analysis does not take the proportional limit and is valid in a general finite sample and finite dimensional setting. As a consequence, the reduction of the predictive error can be shown even in a setting of \(d = \omega (n)\).
1.1 Notations
Throughout this paper, for a vector \({\varvec{v}}\in \mathbb {R}^{n}\) and a positive definite symmetric matrix \(\Sigma \in \mathbb {R}^{n \times n}\), we let \(\Vert {\varvec{v}}\Vert ^2_\Sigma :={\varvec{v}}^\top \Sigma {\varvec{v}}\) and \(\Vert {\varvec{v}}\Vert _\Sigma :=\sqrt{\Vert {\varvec{v}}\Vert ^2_\Sigma }\). For a square matrix \(A\in \mathbb {R}^{n\times n}\), its operator norm is denoted by \(\Vert A\Vert _{\textrm{op}}\) and the i-th largest eigenvalue of A, allowing for ties, is denoted by by \(\mu _i(A)\). Also, for a matrix \(B\in \mathbb {R}^{n\times m}\), let \(B_{0:k}\in \mathbb {R}^{n\times k}\) denote the submatrix of B from which the left k columns are taken and \(B_{k:\infty }\in \mathbb {R}^{n\times m-k}\) denotes the submatrix of B consisting of the other columns. Similarly, for \({\varvec{v}}\in \mathbb {R}^n\), let \({\varvec{v}}_{0:k}\in \mathbb {R}^k\) be a vector consisting of the upper k components of \({\varvec{v}}\) and \({\varvec{v}}_{k:\infty }\in \mathbb {R}^{n-k}\) be a vector consisting of the other components. Let \(\mathcal {O}(\cdot )\) and \(o(\cdot )\) denote the Landau’s asymptotic big-O and little-o notations respectively, and when these notations are used for fixed finite demensional matrices and vectors, assume that the asymptotic property holds for each component. In addition, for a sequence of random variables \(X_n\), \(X_n = o_\mathbb {P}(1)\overset{\textrm{def}}{\iff }\) for every \(\varepsilon >0\), \(\lim _{n\rightarrow \infty }P(|X_n|\le \varepsilon )=1\) and \(X_n = o_\mathbb {P}(f(n))\overset{\textrm{def}}{\iff }\) \(X_n/f(n)=o_\mathbb {P}(1)\). Similarly, \(X_n = \mathcal {O}_\mathbb {P}(1)\overset{\textrm{def}}{\iff }\) for every \(\varepsilon >0\), there exist a constant \(K(\varepsilon )\) and an integer \(N(\varepsilon )\) such that if \(n\ge N(\varepsilon )\) then \(P(|X_n|\le K(\varepsilon ))\ge 1-\varepsilon \), and \(X_n = \mathcal {O}_\mathbb {P}(f(n))\overset{\textrm{def}}{\iff }\) \(X_n/f(n)=\mathcal {O}_\mathbb {P}(1)\).
1.2 Related work
Benign overfitting of linear estimator. Recent works provided analyses of linear models under high demensional input [1, 4, 10, 19, 20]. In particular, feature learning in the two-layer linear neural network can be regarded as a kind of preconditioning analyzed in Amari et al. [1]. However this result cannot be applied to this study because it considered only diagonalizable preconditioning. Benign overfitting was also analysed in the random feature model of the two-layer nonlinear neural network [13,14,15].
All of these estimator are linear estimators, that is, they are linear to the output data. However, the estimator we consider in this study is not linear because feature learning in the first layer induced nonlinearity to the output data.
Two-layer neural network model. Two-layer neural network with feature learning is one of the simplest deep learning models. It is known theoretically that appropriately sized deep learning estimators outperform linear estimators for certain classes of functions [11, 18]. However deep learning estimators perform well even when the appropriate model size is not chosen and then some works analyzed such situation [2, 3, 6, 8, 9]. In particular, it has been pointed out that the transition trajectories of the parameters in the learning process are highly dependent on the initial stage of learning [12], and Ba et al. [3] showed that the estimator with one step feature learning outperform linear estimators where sample size is larger than input dimension. On the other hand, we consider the estimators with one step feature learning where input dimension is larger than sample size. In addition, this study performs analysis on finite data while Ba et al. [3] did in proportional limit. The importance of analysis with finite data has been pointed out in detail by Cheng and Montanari [7].
1.3 Paper organization
Section 2 describes the learning model and input–output data setup that is the subject of theoretical analysis in this study, and provides an interpretation of the proposed model. In Sect. 3, we give an analysis of the predictive errors given by these learning methods and also refer to a brief note and discussion of the results. In Sect. 4, we perform numerical experiments related to the main results of Sect. 3 and discussion based on these results. Section 5 again summarizes this study and its results, and mentions future works. The proof of the analyses are performed in Sects. 6 and 7. Section 8 contains numerical experiments on facts not directly related to the analytical results of this study. Section 9 summarizes the perturbation theory of matrices necessary for the analysis.
2 Problem setup and training procedure
The goal of this study is to construct an estimator based on training data and theoretically evaluate the predictive error of this estimator. In Sect. 2.1, we describe the problem setting and assumptions on the underlying distribution in this study, which is consistent with Tsigler and Bartlett [19]. In Sect. 2.2, we explain the two-layer linear network we consider in this study and introduce the feature learning procedure.
2.1 Setting of input and output data
-
1.
Suppose that we observe an i.i.d. sequence of input \(({\varvec{x}}_i)_{i=1}^n \in \mathbb {R}^d\) whose mean is zero and covariance matrix is \(\Sigma \). Without loss of generality, \(\Sigma \) can be transformed into a diagonal matrix with eigenvalues arranged in descending order by an appropriate change of basis, and then we assume that \(\Sigma \) = diag\((\lambda _1, \lambda _2, \dots ,\lambda _d) \) with \(\lambda _1 \ge \lambda _2\ge \cdots \ge \lambda _d\). In addition, we assume that the random variable \(\Sigma ^{-\frac{1}{2}}{\varvec{x}}_i\) (if \(\lambda _{k}\), diag) follows a sub-Gaussian distribution with sub-Gaussian norm \(\sigma _x\).Footnote 1
-
2.
The output data \((y_i)_{i=1}^n\in \mathbb {R}\) are assumed to be observed as the value of a linear function \(f^*({\varvec{x}}_i)={\varvec{x}}_i^\top {\varvec{a}}^*\) with additive observation noise:
$$\begin{aligned} y_i={\varvec{x}}_i^\top {\varvec{a}}^*+\epsilon _i\quad (i=1,\dots ,n). \nonumber \end{aligned}$$Here, \((\epsilon _i)_{i=1}^n\) follow a Gaussian distribution with mean 0 and variance \(\sigma _\epsilon ^2\) independently: \(\epsilon _i \overset{\mathrm {i.i.d.}}{\sim }\mathcal {N}(0,\sigma _\epsilon ^2)\). Furthermore, we assume \(\Vert {\varvec{a}}^*\Vert ^2_\Sigma =1\) without loss of generality.
In this study, we construct an estimator of a linear model \(\hat{f}({\varvec{x}})={\varvec{x}}^\top \hat{{\varvec{a}}}\) from the training data \(({\varvec{x}}_i,y_i)_{i=1}^n\) that satisfy the above conditions and evaluate the predictive error measured by the \(L^2\)-norm: \(\mathbb {E}_{{\varvec{x}}}[(f^*({\varvec{x}})-\hat{f}({\varvec{x}}))^2]\) where the expectation of x is taken over the unseen test data which is independent of the training data. Here, note that the predictive error can be expressed as follows:
2.2 Training procedure
In this study, we consider a fully-connected two-layer linear neural network with N neurons. The input and output of the network are represented as follows:
Next, we describe how to train the parameters \(W,{\varvec{a}}\) of this model.
-
1.
First, we randomly initialized the parameters \(W = W_0\) and \({\varvec{a}} = {\varvec{a}}_0\) as i.i.d. realization of Gaussian random variables: \(\sqrt{N} ~W_{ij}\overset{}{\sim }\mathcal {N}(0,1),~\sqrt{N} ~a_j \overset{}{\sim }\mathcal {N}(0,1)~(i \in [d],~j\in [N])\).
-
2.
Second, the first-layer parameter \(W_0\) is updated by one-step gradient descent to reduce the training error \(\mathcal {L}=\frac{1}{n}\sum _{i=1}^n\left( y_i-f({\varvec{x}}_i)\right) ^2\). More precisely, it is updated as
$$\begin{aligned} W_1 = W_0 - \eta P(\lambda ^{(1)}) \nabla _{W_0}\mathcal {L}.\nonumber \end{aligned}$$Here, \(P(\lambda ^{(1)})\) is a \(d\times d \) matrix that depends on the regularization parameter \(\lambda ^{(1)}\ge 0\), and \(\eta \ge 0\) represents the step size.
-
3.
Third, \({\varvec{a}}_0\) is updated by performing the ridge regression with independent realization of the training data \((\tilde{{\varvec{x}}}_i,\tilde{y}_i)_{i=1}^n\) which have the same distribution as those used to train the first layer:
$$\begin{aligned} {\varvec{a}}_1 = \text{ argmin}_{{\varvec{a}}\in \mathbb {R}^N}\, \left\{ \sum _{i=1}^n(\tilde{y}_i-\tilde{{\varvec{x}}}_i^\top W_1 {\varvec{a}})+\lambda ^{(2)}\Vert {\varvec{a}}\Vert ^2\right\} .\nonumber \end{aligned}$$
Here, we give some remark on the training procedure described above. In the following, we write \(X := ({\varvec{x}}_1,\dots ,{\varvec{x}}_n)^\top ,~\tilde{X} := (\tilde{{\varvec{x}}}_1,\dots ,\tilde{{\varvec{x}}}_n)^\top ,~{\varvec{y}} = (y_1,\dots ,y_n)^\top ,~\tilde{{\varvec{y}}}=(\tilde{y}_1,\dots ,\tilde{y}_n)^\top \). We first note that the matrix \(\nabla _{W_0}\mathcal {L}\) of the update equation of the first-layer is represented as
This shows that the update from \(W_0\) to \(W_1\) is given by a rank-1 matrix. Therefore, we may write
where \(\tilde{a}\) is given by
In this study, we primarily analyze the situation where the number of neurons N is infinitely large while keeping d, n finite. In this case, we can see that \(\tilde{{\varvec{a}}}\overset{a.s.}{\rightarrow }P(\lambda ^{(1)})\frac{1}{n}X^\top {\varvec{y}}~( N\rightarrow \infty )\) because the second term of (4) converges to a zero matrix.
Next, in this study, we analyze three choices of the matrix of \(P(\lambda ^{(1)})\):
Here, \(\Sigma \) is not necessarily a non-singular matrix and \(X^\top X\) is not a non-singular matrix because \(X\in \mathbb {R}^{n\times d}\) and \(n<d\). In this case, if \(\lambda ^{(1)} = 0\), \((\Sigma +\lambda ^{(1)}I_d)^{-1}\) and \(\left( \frac{1}{n}(X^\top X+\lambda ^{(1)}I_d)\right) ^{-1}\) don’t exist. However, even in this case, \(\lim _{\lambda ^{(1)}\rightarrow 0}P_i(\lambda ^{(1)})\frac{1}{n}X^\top \left( {\varvec{y}}-\frac{1}{\sqrt{N}}XW_0{\varvec{a}}\right) ~(i=2,3)\) exists. So, we define \(P_i(0):=\lim _{\lambda ^{(1)\rightarrow 0}}P_i(\lambda ^{(1)})~(i=2,3)\) for convenience.
The following is an interpretation of each \(P(\lambda ^{(1)})\)
-
1.
When \(P = P_1\), the update of W corresponds to just the vanilla gradient descent.
-
2.
When we use \(P = P_2\), then the update of \(W_0\) can be seen as the mixture of the natural gradient descent (NGD) and the gradient descent (GD). Indeed, if \(\lambda ^{(1)} =0\), it is exactly NGD, and as \(\lambda ^{(1)}\) goes to \(\infty \), it approaches to GD. In that sense, \(\lambda ^{(1)}\) is a parameter that controls the mixture proportion of NGD and GD. The difference of NGD and GD is crucial especially when we consider the one-step update. Indeed, by noticing that \(\mathbb {E}_{X,\epsilon }[\frac{1}{n}X^\top X{\varvec{y}}]=\Sigma {\varvec{a}}^*\) since \(\mathbb {E}_\epsilon [{\varvec{y}}]=X{\varvec{a}}^*\), the update direction of NGD \((\lambda ^{(1)} = 0)\) is exactly same as the true signal \({\varvec{a}}^*\), which means that the NGD can efficiently capture the true signal in average. On the other hand, the inverse of the small eigenvalue of \(\Sigma \) is large, so the noise component in \(\frac{1}{n}X^\top X{\varvec{y}}\) is amplified by \(\Sigma ^{-1}\). Therefore \(\lambda ^{(1)}\) can be seen as a regularizer that uniformly terminates small eigenvalue components of \(\Sigma \) at the value of \(\lambda ^{(1)}\) and suppresses the amplification of noise.
-
3.
Third, since
$$\begin{aligned} \text{ argmin}_{\bar{{\varvec{a}}}\in \mathbb {R}^d} \{\Vert X\bar{{\varvec{a}}}-{\varvec{y}}\Vert ^2+\lambda ^{(1)}\Vert \bar{{\varvec{a}}}\Vert ^2\}= & {} \left( \frac{1}{n}(X^\top X+\lambda ^{(1)}I_d)\right) ^{-1}\frac{1}{n}X^\top {\varvec{y}} \nonumber \\= & {} X^\top (XX^\top +\lambda ^{(1)}I_n)^{-1}{\varvec{y}}, \end{aligned}$$(5)we can see that the direction of the update is a ridge regression where \(P(\lambda ^{(1)})=P_3(\lambda ^{(1)})\). The derivation of (5) is described in Sect. 6.
Next, we discuss the optimization of the second layer. This is a one-layer linear ridge regression on the data \(XW_1\) passing through the first-layer. However, as we have already seen, \(W_1\) and X are not independent. In our analysis, we then optimize the second layer \({\varvec{a}}_1\) based on the same size data \(\tilde{X}\) sampled again independently of X and suppress the stochastic behavior of X and \(\tilde{X}\) in two parts. This is an approximate setup for the analysis, but the qualitative behavior is consistent with that of actual models that are learned with unsplit data (Fig. 11). In a situation where data are split, as in (5), \({\varvec{a}}_1\) can be written explicitly with \(W_1\) and \(\tilde{X}\):
Therefore, the estimator of the two-layer linear model in this study is given by \(\hat{{\varvec{a}}}=W_1{\varvec{a}}_1 = W_1 W_1^\top \tilde{X}^\top (\tilde{X}W_1 W_1^\top \tilde{X}^\top +\lambda ^{(2)} I_n)^{-1}\tilde{{\varvec{y}}}\). This estimator \(\hat{{\varvec{a}}}\) can be regarded as a kind of preconditioned estimators analyzed in Amari et al. [1]. In general, preconditioned estimators is given by
with a matrix P. Therefore \(\hat{{\varvec{a}}}\) is a variant of a preconditioned estimator where \(P=W_1 W_1^\top \) in (6). In particular, since \(W_0 W_0^\top \overset{a.s.}{\rightarrow }I_d~( N/d \rightarrow \infty )\), the random feature model is consistent with a one-layer linear regression model when N is sufficiently large (Fig. 9). As we have mentioned above, we consider a setting where N is taken to be \(\infty \) so that \(W_0W_0^\top =I_d\) because our goal is to compare the respective predictive errors of the estimator obtained by the one-step feature learning model and one obtained by the one-layer linear regression model.
3 Main results
In this section, we first give the Bias-Variance decomposition of the predictive error, and then derive upper-bounds on the Bias and Variance for the general step size. We next explicitly compare the predictive errors between a feature learning method and a non-feature learning method in two different regimes of the step size: small step size setting and large step size setting.
The predictive error given in (1) is decomposed into Bias and Variance as
The following theorem enables us to reduce evaluation of the Bias and Variance of the two layer neural network to that of the vanilla (one-layer) ridge regression.
Theorem 1
Denote the right singular vectors of \(\Sigma ^\frac{1}{2}W_1\) by \(\tilde{V}\in \mathbb {R}^{N\times N}\). Here, \(\Lambda := \tilde{V}^\top W_1^\top \Sigma W_1\tilde{V}\) is a diagonal matrix such that \(\Lambda =\text{ diag }(\tilde{\lambda }_1,\dots ,\tilde{\lambda }_N)\) with \(\tilde{\lambda }_1\ge \tilde{\lambda }_2\ge \cdots \ge \tilde{\lambda }_N\). Then, it holds that
Here, let \(W_1^\star \) be any matrix satisfying \(W_1W_1^\star =I_d\). Since \(W_1\) is row full rank, we can easily check that \(W_1^\star \) exists. Indeed, \(W_1^\dagger \) satisfies \(W_1W_1^\dagger =I_d\).
Remark 1
Since \(W_1\in \mathbb {R}^{d\times N}\) and \(N>d\), \(\tilde{\lambda }_{d+1}=\cdots =\tilde{\lambda }_N=0\).
Once the first layer’s parameter is fixed, the predictive error of the two-layer network can be evaluated through the usual analysis on the one-layer network. The following theorem gives upper bounds of the Bias and Variance of the two-layer network through the analysis on the usual ridge regression in the interpolation regime [19]. With this theorem, we can evaluate the effect of feature learning on the predictive error. From \(\mathbb {E}[\tilde{V}^\top W_1^\top {\varvec{x}}{\varvec{x}}^\top W_1\tilde{V}] = \tilde{V}^\top W_1^\top \Sigma W_1\tilde{V}\), the proof is obtained immediately by applying Theorem 1 of Tsigler and Bartlett [19] to Theorem 1.
Theorem 2
Suppose \(\lambda ^{(2)} \ge 0\). There exist constants \(c_x\) and \(C_x\) depending only on \(\sigma _x\) such that, for any \(\delta <1-4e^{-n/c_x^2}\), (1) the condition number of \(XW_1V_{k:\infty }V_{k:\infty }^\top W_1^\top X^\top +\lambda ^{(2)}I_N\) is at most L with probability \(1-\delta \) where L is a constant such that
and (2) for any \(k \le n/c_x,~t \in (1,n/c_x)\) , we have
with probability \(1-\delta -20e^{-t/c_x}\).
Here, from the note on the random feature model in Sect. 2, the evaluation of the Bias and Variance with \(\eta =0\) follows that of a one-layer ridge regression, namely, the following corollary holds.
Corollary 1
Letting \(\eta =0\) under the same conditions as in Theorem 2, there exists \(N_0\in \mathbb {Z}\) such that, for any \(N\ge N_0\), the following inequalities hold:
Remark 2
\(\bar{B}_0\) and \(\bar{V}_0\) given in Corollary 1 match the evaluation of Bias and Variance in Theorem 1 of Tsigler and Bartlett [19].
We then present some additional lemmas on Theorem 2.
Lemma 1
The second term of \(\bar{B}_1\) is represented as follows:
where, \(\tilde{U}\in \mathbb {R}^{d\times d}\) are the left singular vectors of \(\Sigma ^{\frac{1}{2}}W_1\).
Remark 3
The value of \(\bar{B}_1\) does not depend on the choice of \(W_1^*\).
3.1 Error analysis of sufficiently small step feature learning
In this subsection, we give a first-order perturbation of the predictive error by considering a small step size regime: \(\eta 1\). We first note that the step size need to be appropriately reduced as N increases to performe perturbation analysis.
Now, for given \(\tilde{{\varvec{a}}}\), \(W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +\eta (W_0 {\varvec{a}}\tilde{{\varvec{a}}}^\top +\tilde{{\varvec{a}}}(W_0 {\varvec{a}})^\top )+(W_0W_0^\top -I_d)\) holds from (3). As \(W_0W_0^\top =I_d+\mathcal {O}_\mathbb {P}(\sqrt{d/N})\) and \(\Vert W_0 {\varvec{a}}\Vert =\mathcal {O}_\mathbb {P}(\sqrt{d/N})\), when \(\eta = N^\gamma \,(\gamma \in (-1/2,0))\) and N is sufficiently large, \(W_1 W_1^\top \) is represented as \(W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +o_\mathbb {P}(\eta ^2)\). The singular values and singular vectors of \(\Sigma ^{\frac{1}{2}}W_1\) can be evaluated from the expansion. Under this assumption, we give the predictive error after feature learning below. Let \(\eta ^2=\varepsilon \).
Theorem 3
Let \(\bar{B}_1\) and \(\bar{V}_1\) be those given in Theorem 2, and let k and \(\lambda ^{(2)}\) satisfy the same assumptions as in Theorem 2. Furthermore, let \(W_1 W_1^\top =I_d+\varepsilon \tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +o_\mathbb {P}(\varepsilon )\) and assume that the eigenvalues of \(\Sigma \) are not degenerate. Then, \(\bar{B}_1\) and \(\bar{V}_1\) are represented as follows:
Here, \({\varvec{a}}^2_i:=({\varvec{a}}_i)^2\) for any vector \({\varvec{a}}\).
Let each term in the right hand side of the bias \(\bar{B}_1\) in Theorem 3 be \(b_1\) to \(b_6\) respectively, that is,
Each term has the following meaning. First, \(b_1\) and \(b_5\) are the same terms as \(\bar{B}_0\). Then, \(b_2\) and \(b_3\) are the terms that mainly increase or decrease \(\bar{B}_1\) from \(\bar{B}_0\) because \(b_4\) and \(b_6\) are zeros in the setting of sparse signal. Next, we give a corollary below to analyze the effect of \(b_4\) and\(~b_6\) in detail.
Corollary 2
Let k satisfy the same assumptions as in Theorem 2 and let \(\tilde{{\varvec{a}}}=P^{-1}X^\top {\varvec{y}}/n\) with a diagonal matrix \(P=\text{ diag }(p_1,\dots ,p_d)\). Then, the following equations hold:
Remember that \(P = \Sigma \) and \(P= I_d\) correspond to the NGD and GD update respectively.
From Corollary 2, we have the following observations. First, \(b_4\) represents the interaction among the first k components of \({\varvec{a}}^*\). In the summand, the terms with smaller \(\lambda _i - \lambda _j\) have greater impact on the whole bias because \(\lambda _i - \lambda _j\) appears in the denominator in each term. Since the first term is always negative, it makes \(\bar{B}_1\) smaller on average, while the second term makes \(\bar{B}_1\) larger on average. \(b_6\) represents the interaction between the first k components of \({\varvec{a}}^*\) and the rest components of \({\varvec{a}}^*\). Because \(\mathbb {E}_{X,\epsilon }[b_6]\) is always positive, \(b_6\) makes \(\bar{B}_1\) larger on average. However, its effect is relatively small because \(\lambda _i - \lambda _j\) in the denominator tends to be large as they are from different components; i.e., i is from the first k components and j is from the rest.
Sparse signal In practice, signals are often sparse under high dimensional inputs, namely, most components of the \({\varvec{a}}^*\) are zero. Then, in the following, we derive a more detailed evaluation of Theorem 3 where \({\varvec{a}}^*\) is completely sparse, that is, \({\varvec{a}}^*_{k^*}=1/\sqrt{\lambda _{k^*}},~{\varvec{a}}^*_i=0~(\forall i\ne k^*)\) for \(k^*\in [d]\). In this case, since \(b_4=b_6=0\), we obtain the following corollaries where each term in the bias is due to \(b_2\) and \(b_3\).
Corollary 3
Let \(\bar{B}_1\) and \(\bar{V}_1\) be given in Theorem 3 and let k and \(\lambda ^{(2)}\) satisfy the same assumptions in Theorem 2. If \({\varvec{a}}^*_i = \lambda _i^{-1/2} ~(i=k^*),~{\varvec{a}}^*_i = 0 ~(i\ne k^*)\), for \(k^*\le k\) and \(\tilde{{\varvec{a}}} = (\Sigma +\lambda ^{(1)}I_d)^{-1}X^\top {\varvec{y}}/n\), the following equations hold:
In particular, if \(\lambda ^{(1)}=0\), we have following evaluations:
Corollary 3 shows how NGD affects the bias through feature learning. On the other hand, the following corollary gives the effect of GD.
Corollary 4
Let \(\bar{B}_1\) and \(\bar{V}_1\) be given in Theorem 3 and let k, and \(\lambda ^{(2)}\) satisfy the same assumptions in Theorem 2. If \({\varvec{a}}^*_i = \lambda _i^{-1/2} ~(i=k^*),~{\varvec{a}}^*_i = 0 ~(i\ne k^*)\), for \(k^*\le k\) and \(\tilde{{\varvec{a}}} = X^\top {\varvec{y}}/n\), the following equations hold:
In the following, we discuss in more detail the effect of feature learning with NGD and with GD.
For \(\tilde{{\varvec{a}}}=\Sigma ^{-1}X^\top {\varvec{y}}/n\), while \(\tilde{{\varvec{a}}}\) is an unbiased estimator of \({\varvec{a}}^*\) (\(\mathbb {E}_{X,{\varvec{\epsilon }}}[\tilde{{\varvec{a}}}]={\varvec{a}}^*\)), its variance is far away from zero. Indeed, \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert _\Sigma ^2]\) is approximately \(\frac{d(1+\sigma ^2_\epsilon )}{n}\) from (11). Therefore, in the case where d is larger than n, the estimation accuracy of the first-layer is quite poor. However, Corollary 3 suggests that feature learning can improve predictive error even in such a case. In fact, the following corollary holds for the situation where the bias is reduced.
Corollary 5
Let k, and \(\lambda ^{(2)}\) satisfy the same assumptions in Theorem 2 and let \(\lambda _i=i^{-(1+\alpha )}~(-1<\alpha ,~\alpha \ne 0)\). Then, the following holds:
Especially, letting \(k=o(d),~\lambda ^{(2)}=0\), and \(-1<\alpha <0\), we have that
Hence, if \({k^*}^{1+\alpha }>\frac{|\alpha |(1+\sigma ^2_\epsilon )}{n}d^{1+\alpha }\), then it holds that
and moreover \(\bar{B}_1<\bar{B}_0\) holds.
Corollary 5 tells that the threshold of \(k^*\) yielding better generalization by feature learning is \({k^*}^{1+\alpha }=\frac{|\alpha |(1+\sigma ^2_\epsilon )}{n}d^{1+\alpha }\) if \(\alpha <0\). Especially if \(d=o(n^\frac{1}{1+\alpha })\), the improvement of the bias is \({k^*}^{1+\alpha }\). On the other hand, if d is larger than n and the eigenvalue decay rate is fast, feature learning worsens bias though one-layer linear regression may generalize even in the case where \(d=\infty \) and \(n<\infty \) from Bartlett et al. [4]. In this regime, it is shown that ridge regression exhibits good predictive performance.
For \(\tilde{{\varvec{a}}}=X^\top {\varvec{y}}/n\), decrease in the bias \(\lambda _{k^*}-\frac{1+\sigma _\epsilon ^2}{n}\frac{\sum _{i>k}\lambda _i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}<\lambda _{k^*}={k^*}^{-(1+\alpha )}\) in the same case as Corollary 5. Therefore, as \(k^*\rightarrow \infty \), the improvement of bias converges to zero.
Therefore, when \(k^*\) is large and exeeds the threshold and the eigenvalue decay rate is slow, feature learning with NGD improves the bias while feature learning with GD does not, and then NGD is superior to GD.
3.2 Error analysis of sufficiently large step feature learning
In this subsection, we give an evaluation of the predictive error when the step size \(\eta \) is sufficiently large.
Ba et al. [3] shows that, when \(\eta =\mathcal {O}(N)\), one-step gradient feature learning can outperform the kernel method as \(n,d,N\rightarrow \infty \). It is also reported that, if \(\eta =\mathcal {O}(1)\), feature learning does not improve the predictive error, and \(\eta =\omega ({\sqrt{N}})\) is still not enough.
In this study, we analyze the predictive error in the case where N is sufficiently large while n and d are finite. Here, we assume that the step size \(\eta \) is within the range of \(N^\gamma ~(\gamma \in (0,\frac{1}{2}))\) for the following reasons. As we have seen in Sect. 3.1, \(W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +\eta (W_0 {\varvec{a}}\tilde{{\varvec{a}}}^\top +\tilde{{\varvec{a}}}(W_0 {\varvec{a}})^\top )\) and \(\eta (W_0 {\varvec{a}}\tilde{{\varvec{a}}}^\top +\tilde{{\varvec{a}}}(W_0 {\varvec{a}})^\top )=\mathcal {O}_\mathbb {P}(\eta \sqrt{\frac{d}{N}})\). Therefore, when \(\eta =N^\gamma ~(\gamma \in (0,\frac{1}{2}))\), \(W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +\mathcal {O}_\mathbb {P}(\sqrt{d}\eta ^{1-1/2\gamma }) \). Then, for large \(\eta \), \(W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +o_\mathbb {P}(1) \) and following theorem holds.
We let \(\tilde{U}=\lim _{\eta \rightarrow \infty }\tilde{U}(\eta )\) where \(\tilde{U}(\eta )\) is the eigenvectors of \(\Sigma ^\frac{1}{2}W_1W_1^\top \Sigma ^\frac{1}{2}=\Sigma +\eta ^2\Sigma ^\frac{1}{2}\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top \Sigma ^\frac{1}{2}\). Now we can easily check that the limit exists and \(\tilde{U}\) is an orthogonal matrix. Here, note that the first eigenvector of \(\tilde{U}\) is \(\Sigma ^{\frac{1}{2}} \tilde{{\varvec{a}}}\) and the others are vectors orthogonal to \(\Sigma ^{\frac{1}{2}} \tilde{{\varvec{a}}}\).
Theorem 4
Let \(\bar{B}_1\) and \(\bar{V}_1\) be given in Theorem 2 and let k and \(~\lambda ^{(2)}\) satisfy the same assumption as Theorem 2. Then, letting \({\varvec{b}}=\tilde{{\varvec{a}}}-(\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*){\varvec{a}}^*\) and if \( W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +o_\mathbb {P}(1)\), the following holds.
As for the variance, since \(\sum _{i>k}\lambda _i\le \sum _{i>k}\tilde{\lambda }_i\le \sum _{i>k}\lambda _i+\lambda _1\) in (13), it holds that
from Theorem 2. It is expected that \(\sum _{i>k}\tilde{\lambda }_i^2\) also approximately equals to \(\sum _{i>k}\lambda _i^2\).
In Theorem 4, if \(\tilde{{\varvec{a}}}\) approximates \({\varvec{a}}^*\), \(\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*\) approximately equals to 1. Then, \(\Vert {\varvec{b}}\Vert ^2_\Sigma \) matches \(\Vert {\varvec{a}}^*-\tilde{{\varvec{a}}}\Vert ^2_\Sigma \). Therefore, Theorem 4 can be regarded as a formula evaluating the approximation error of the first-layer estimator \(\tilde{{\varvec{a}}}\) with \({\varvec{a}}^*\) by dividing it by \(\tilde{U}\) into the first k components and the rest of them. However it is difficult to explicitly characterize the distribution of \(\tilde{U}\). Hence, we consider two concrete situations in the following so that we explicitly evaluate Theorem 4.
(1) The first one is the case where the distribution of \(\tilde{U}\) is the worst. Since \(\tilde{U}\) is an orthogonal matrix, it holds that \(\left\| \tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2_{\Sigma _{1:k}^{-2}}\le \frac{\lambda _k^{-2}\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\). Here, even in the worst case setting, that is, \(\left\| \tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2_{\Sigma _{1:k}^{-2}}= \frac{\lambda _k^{-2}\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\), following evaluation holds.
Corollary 6
Let \(\mathscr {B}\) be given in (7). There exists a large constant \(c_x\) that only depends on \(\sigma _x\) such that
-
1.
If there exists a constant \(C_x\) that only depends on \(\sigma _x\) and \(n\lambda _{k+1}\ge C_x\sum _{i>k}\lambda _i\) holds for some \(k<n/c_x\), then by letting \(\lambda ^{(2)}=n\tilde{\lambda }_{k+1}\) in Theorem 4, we have that for any \(t\in (c_x,n/c_x)\)
$$\begin{aligned} \mathscr {B} \le D_x\frac{\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*} \nonumber \end{aligned}$$with probability at least \(1-26e^{-t/c_x}\). Here, \(D_x\) is a constant that only depends on \(\sigma _x\).
-
2.
If \(\sum _{i>k}\lambda _i\ge c_xn\lambda _{k+1}\) holds for some \(k<n/c_x\), then by letting \(\lambda ^{(2)}=-\sum _{i>k}\lambda _i+\xi \left( n\lambda _1+\sqrt{n\sum _{i>k}\lambda _i^2}\right) \) for some constant \(\xi >c_x\), we have that for any \(t\in (c_x,n/c_x)\)
$$\begin{aligned} \mathscr {B} \le D_x\left( 1+\frac{\sum _{i>k}\lambda _i^2}{\lambda _k^2}\right) \frac{\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\nonumber \end{aligned}$$with probability at least \(1-20e^{-t/c_x}\). Especially, if \(\lambda _i=i^{-(1+\alpha )}~(-0.5<\alpha \le 0)\), we have \(\mathscr {B} \le D_x\frac{\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\).
Corollary 6 shows that \(\bar{B}_1\) is bounded by a constant multiple of the estimation error in the first layer even in the case where \(\left\| \tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2_{\Sigma _{1:k}^{-2}}= \frac{\lambda _k^{-2}\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\). Especially, for the learning model with \(P_3(\lambda _1)=\left( \frac{1}{n}(X^\top X+\lambda ^{(1)}I_d)\right) ^{-1}\) in the first-layer update, the predictive error for the entire second layer is at most a constant multiple of the one-layer ridge regression model. Here, \(\left\| \tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2_{\Sigma _{1:k}^{-2}}= \frac{\lambda _k^{-2}\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\) holds when \(\tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}{\varvec{b}}\propto {\varvec{e}}_k\). However, since \(\tilde{U}\) is left singular vectors of the matrix which is \(\Sigma ^\frac{1}{2}W_0\) added a rank-1 matrix and left singular vectors of \(\Sigma ^\frac{1}{2}W_0\) are \(I_d\), \(\left\| \tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2_{\Sigma _{1:k}^{-2}}= \frac{\lambda _k^{-2}\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\) does not hold and feature learning decreases the bias experimentally (Fig. 6).
(2) The second one is the case where the true signal is sparse and the estimation error of the first-layer is sufficiently small.
Corollary 7
Let k satisfy the same assumption as Theorem 2 and \(\tilde{U}\) be given in Theorem 4. \(\{{\varvec{e}}_1,\dots ,{\varvec{e}}_d\}\) denotes the standard basis of \(\mathbb {R}^d\) and we assume that the eigenvalues of \(\Sigma \) are not degenerate. In Theorem 4, if \({\varvec{a}}^*=\lambda _{k^*}^{-\frac{1}{2}}{\varvec{e}}_{k^*}~(k^*\le k)\) and \(\frac{\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\) is sufficiently small, we have that for \(\tilde{U}_{1:\infty }=(\tilde{{\varvec{u}}}_2,\dots ,\tilde{{\varvec{u}}}_d)\)
In this case, it holds that
4 Numerical experiments
In this section, we present numerical experiments and discuss the validity of our theoretical results we provided in Sect. 3.
4.1 Experimental setting
In this subsection, we describe the overall experimental settings in Sect. 4. We assume that \(\Sigma =\text{ diag }(\lambda _1,\dots ,\lambda _d)\) satisfies \(\lambda _i=i^{-(1+\alpha )}~(\alpha >-1)\) (recall that \(\Sigma \) is the covariance matrix of input data \({\varvec{x}}\)). Namely, we suppose that the distribution of eigenvalues polynomially decay and \(\alpha \) determines the decay rate. Here, note that \(\lim _{d\rightarrow \infty }\sum _{i=1}^d\lambda _i<\infty \) when \(\alpha >0\), and \(\lim _{d\rightarrow \infty }\sum _{i=1}^d\lambda _i=\infty \) when \(-1<\alpha \le 0\). Furthermore, we asuume that input data \({\varvec{x}}\) and noise independently follow Gaussian distributions: \({\varvec{x}} \overset{\text {i.i.d.}}{\sim } \mathcal {N}(0,\Sigma ),~\epsilon \overset{\text {i.i.d.}}{\sim } \mathcal {N}(0,\sigma _\epsilon ^2)\).
In Sect. 3, for ease of analysis, we constructed the theory in a setting where the data were split into sampled data \((X,{\varvec{y}})\) and resampled data \((\tilde{X},\tilde{{\varvec{y}}})\). In this section, however, experiments are conducted in a setting where the data are not split, which is closer to the real setting. To be more precisely, we compute \(W_1\) from data \((X,{\varvec{y}})\) according to (2) and then compute \(\hat{{\varvec{a}}}_{\textrm{FL}},~\hat{{\varvec{a}}}_{\textrm{RF}},\) and \(\hat{{\varvec{a}}}_{\textrm{LR}}\) from the same data in the following way:
Here, \(\hat{{\varvec{a}}}_{\textrm{FL}},~\hat{{\varvec{a}}}_{\textrm{RF}}\) and \(\hat{{\varvec{a}}}_{\textrm{LR}}\) represent the ridge estimators corresponding to the one-step feature learning method, the random feature model and the vanilla linear regression. In Sect. 8, we present some comparisons of the experimental results with and without splitting. The qualitative behavior of the results is largely the same, throughout the experiment, regardless of the splitting setting.
We approximate the bias and variance of the predictive error given by these estimators with numerical calculation. The way how to calculate them is as follows. For \(\textrm{ES}\in \{\textrm{FL},\textrm{RF},\textrm{LR}\}\), we calculate \(\hat{\mathscr {B}}_{\textrm{ES}}\) and \(\hat{\mathscr {V}}_{\textrm{ES}}\) M times over M independent realizations of the training data. For the i-th realization of \(X_i\) and \({\varvec{\epsilon }}_i\) of the training data, we calculate the bias \(\hat{\mathscr {B}}_{\textrm{ES}}^{(i)}\) and variance \(\hat{\mathscr {V}}^{(i)}_{\textrm{ES}}~(i\in [M])\) as
where
Then, we calculate their average \(\frac{1}{M}\sum _{i=1}^M\hat{\mathscr {B}}^{(i)}_{\textrm{ES}}\) and \(\frac{1}{M}\sum _{i=1}^M\hat{\mathscr {V}}^{(i)}_{\textrm{ES}}\) as estimates of their true expectation \(\mathbb {E}_X[\mathscr {B}_{\textrm{ES}}]\) and \(\mathbb {E}_X[\mathscr {V}_{\textrm{ES}}]\) respectively.
In the following, unless otherwise mentioned, each parameter is set as follows: Sample size \(n=100\), dimension of input data \(d=200\), number of neurons in the middle layer \(N=1000,~\alpha =-0.5,~k^* = 100,~\sigma _\epsilon ^2=0.01\) and \(\lambda ^{(1)}=\lambda ^{(2)}=0.\)
4.2 Experiments for small step size
In this subsection, we perform experiments for the small step size setting corresponding to Sect. 3.1. Although the smaller the step size the more accurate our analysis is, we employ rather large step size \(\eta =1\) because the improvement of the predictive error is hidden behind the numerical errors. Here, we let \(M=1000\).
We observe the relationship between each parameter and the bias where \({\varvec{a}}^*_{k^*}=1/\sqrt{\lambda _{k^*}},~{\varvec{a}}^*_i=0~(\forall i\ne k^*)\) for \(k^*\in [d]\). Here, we calculate \(\frac{\mathscr {B}_{\textrm{FL}}-\mathscr {B}_{\textrm{LR}}}{\mathscr {B}_{\textrm{LR}}}\) and \(\frac{\mathscr {B}_{\textrm{FL}}-\mathscr {B}_{\textrm{RF}}}{\mathscr {B}_{\textrm{RF}}}\) to compare the bias given by GD and NGD feature learning methods with those given by linear regression model or random feature model. According to Corollaries 3 and 4, the change in the bias due to feature learning consists of the first term which makes the bias smaller and the second term which makes it larger.
In Fig. 1, we see that feature learning decreases the relative bias well when \(\tilde{{\varvec{a}}}\) is aligned on \({\varvec{a}}^*\) well. In Figs. 2 and 3, we see that when d and \(\sigma _\epsilon ^2\) is increased, the variance \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert ^2]\) becomes larger and the relative bias is increased especially due to feature learning with NGD. In Fig. 4, we see that proper regularization can reduce the bias further.
Figure 1 shows the relationship between \(k^*\) and the bias for some specific \(\alpha \).
When \(k^*\) is small, the relative bias is increased due to feature learning with NGD. Especially, as \(\alpha \) gets larger, the relative bias gets larger except for \(k^*=1\) in (d). This is because, when \(k^*\) is small, the first term of Corollary 3 which reduces the bias is small and the second term which increases it is large. In contrast, the relative bias is reduced due to feature learning with GD where \(k^*\) is small. Especially, the reduction is large when \(\alpha \) is large. This is because, when \(k^*\) is small, \({\varvec{a^*}}\) is aligned on input and \(\tilde{{\varvec{a}}}\) in (4) with GD estimates the signal well while \(\tilde{{\varvec{a}}}\) with NGD does not due to the large variance \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert ^2]\). Here, the comparisons with linear regression model behave differently from that with random feature model where feature learning is GD. This is simply because the difference between \(\hat{{\varvec{a}}}_{\textrm{RF}}\) and \(\hat{{\varvec{a}}}_{\textrm{LR}}\) is huge for finite N/d where \(\alpha \) is large or \(k^*\) is small (Fig. 10).
We next see that the relative bias for NGD feature learning decreases as \(k^*\) is increased because the decrease depending on \(k^*\) is \(\frac{1}{\lambda _{k^*}} = {k^*}^{(1+\alpha )}\). Moreover, the rate of decreasing with increasing \(k^*\) is fast when \(\alpha \) is large. In general, regression is difficult when \(k^*\) is large because the principle component of \({\varvec{x}}\) is different from the direction of \({\varvec{a}}^*\). This means one step feature learning with NGD is beneficial for difficult instances. On the contrary, as \(k^*\) gets larger, the decrease in the relative bias due to feature learning with GD gets small because the decrease depending on \(k^*\) is \({k^*}^{-(1+\alpha )}\). Furthermore, for larger \(\alpha \), the decrease in bias approaches zero more quickly with increasing \(k^*\). Hence, the benefit of one step feature learning with GD is approximately zero for difficult instances though that is large for easy instances.
Figure 2 shows how the bias changes against different d when \(k^*=100\) is fixed.
In both (a) and (b), the bias with NGD increases as d is increased. This is because the second term of Corollary 3 is monotonically increases with respect to d; indeed, we see that \(\frac{d-k}{n}\frac{1}{\sum _{i>k}\lambda _i} = \mathcal {O}(\sqrt{d})\) for \(\alpha =-0.5\). This matches the empirical observation in the graph. In contrast, the bias with GD decreases slowly because of the second term of Corollary 4. In fact, the term \(\frac{\sum _{i>k}\lambda _i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}=\mathcal {O}(\log d/\sqrt{d})\) for \(\alpha =-0.5\) and it decreases slowly as d gets larger. The second term of Corollaries 3 and 4 intuitively means that the bias is increased when \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert _\Sigma ^2]\) is large. This gets larger as d gets larger where \(\tilde{{\varvec{a}}}\) consists of NGD feature learning while this gets smaller as ratio \(k^*/d\) gets smaller where \(\tilde{{\varvec{a}}}\) consists of GD feature learning.
Figure 3 shows how the bias changes against \(\sigma _\epsilon ^2\). Here, let \(M =10000\) in order to suppress bias fluctuation when \(\sigma _\epsilon ^2\) is increased.
The relative bias with NGD feature learning is linearly increasing with respect to \(\sigma ^2_\epsilon \). This is because \(\frac{d-k}{n}\frac{1}{\sum _{i>k}\lambda _i}\) which is the coefficient of \(\sigma ^2_\epsilon \) is positive. In contrast, the relative bias with GD feature learning is flat against increasing \(\sigma ^2_\epsilon \). This is because \(\frac{1}{n}\frac{\sum _{i>k}\lambda _i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}\) which is also coefficient of \(\sigma ^2_\epsilon \) is almost zero.
Figure 4a, b show how the bias changes against positive \(\lambda ^{(1)}\) and \(\lambda ^{(2)}\) respectively. Here, as \(\lambda ^{(1)}\) is increased, the step size is effectively smaller and the estimator of the feature learning model approaches that of a random feature model. Therefore, let the step size be \(\eta (1+\lambda ^{(1)})\) to compensate for this.
Note that \(\lambda ^{(1)}=0\) corresponds to a situation where the feature learning is performed by NGD, and \(\lambda ^{(1)} = \infty \) corresponds to that where GD is used. In Fig. 4a, we set \(k^*=60\) where the relative bias with NGD and that with GD are equal, and indeed the values of the bias at \(\lambda ^{(1)} = 0\) and \(\lambda ^{(1)} = 9\) are approximately equal. We can also see that feature learning decreases the relative bias the most where \(\lambda ^{(1)}=0.2\). This indicates that one step feature learning to the intermediate gradient direction between NGD and GD reduces the bias better than just NGD or GD. This can be explained as follows. When feature learning is performed with NGD, \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\tilde{{\varvec{a}}}]={\varvec{a}}^*\) and the first term of Corollary 3 which decreases the bias is large while \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert ^2]\) is large and the second term of Corollary 3 which increases the bias is also large. In contrast, when feature learning is performed with GD, \(\Vert \mathbb {E}_{X,{\varvec{\epsilon }}}[\tilde{{\varvec{a}}}]-{\varvec{a}}^*\Vert ^2\) is large and the first term of Corollary 3 which decreases the bias is small while \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert ^2]\) is small and the second term of Corollary 3 which increases the bias is also small. Therefore, by setting a value of \(\lambda ^{(1)}\) appropriately, feature learning between NGD and GD balances \(\Vert \mathbb {E}_{X,{\varvec{\epsilon }}}[\tilde{{\varvec{a}}}]-{\varvec{a}}^*\Vert ^2\) and \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert ^2]\) and then reduce the bias better.
Next, as for (b), the relative bias with NGD decreases with increasing \(\lambda ^{(2)}\). This is because \(\frac{1}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}\) which is the second term of Corollary 3 is decreasing with increasing \(\lambda ^{(2)}\). On the other hand, the gleen plot which is the relative bias with GD hardly decreases. This is because \(\frac{\sum _{i>k}\lambda _i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}\) which increases the bias is already small even where \(\lambda ^{(2)}=0\). Here, the reason why the blue plot decreases is simply that the bias with linear regression model is larger than that with random feature model when \(\lambda ^{(2)}\) is small.
We will now discuss the variance of Corollaries 3 and 4. In Fig. 5, we plot \(\frac{\mathscr {V}_{\textrm{FL}}-\mathscr {V}_{\textrm{LR}}}{\mathscr {V}_{\textrm{LR}}}\) and \(\frac{\mathscr {V}_{\textrm{FL}}-\mathscr {V}_{\textrm{RF}}}{\mathscr {V}_{\textrm{RF}}}\) for several \(\alpha \).
(a) shows that the variance of estimators with feature learning is almost the same as that with random features (i.e., no feature learning). Though it appears that the variance of feature learning models is worse than that of the linear regression in (b), this is simply because the variance of the random feature model is larger than that of the linear regression model. The fact that feature learning makes the variance almost no worse can be explained as follows. The variance given by linear regression is mostly determined by the decay rate of the eigenvalues of the covariance matrix \(\mathbb {E}[{\varvec{x}}{\varvec{x}}^T]\). Here, in feature learning models, the decay rates of the covariance matrix of the data passing through the first layer \(\mathbb {E}[{\varvec{x}}W_1W_1^\top {\varvec{x}}^\top ]\) is almost the same as that of \(\mathbb {E}[{\varvec{x}}{\varvec{x}}^T]\) because \(W_1\) is given by adding just a rank-1 matrix to \(W_0\) which approximately satisfies \(W_0W_0^\top = I\). Therefore, the variance is not changed much by one step feature learning.
4.3 Experiments for large step size
In this subsection, we perform experiments for the large step size setting which corresponds to our theoretical analysis in Sect. 3.2. In updating first layer, we write the feature learning model as \(\hat{{\varvec{a}}}_{\mathrm {FL_{GD}}}\), \(\hat{{\varvec{a}}}_{\mathrm {FL_{NGD}}}\) and \(\hat{{\varvec{a}}}_{\mathrm {FL_{WLR}}}\) corresponding to \(W_1 = W_0 + \eta P_1(\lambda ^{(1)}) \nabla _{W_0}\mathcal {L}\), \(W_1 = W_0 + \eta P_2(\lambda ^{(1)}) \nabla _{W_0}\mathcal {L}\) and \(W_1 = W_0 + \eta P_3(\lambda ^{(1)}) \nabla _{W_0}\mathcal {L}\) respectively.
Figure 6 depicts the bias \(\mathscr {B}_{\textrm{ES}}\) corresponding to each \(\hat{{\varvec{a}}}_\textrm{ES}\) for \(\textrm{ES}\in \{\mathrm {FL_{GD}},~\mathrm {FL_{NGD}}, ~\mathrm {FL_{WLR}},~\textrm{RF},~\textrm{LR}\}\) against the step size \(\eta \) where \({\varvec{a}}^*_{k^*}=1/\sqrt{\lambda _{k^*}},~{\varvec{a}}^*_i=0~(\forall i\ne k^*)\).
We observe that \(\mathscr {B}_{\mathrm {FL_{WLR}}}\) and \(\mathscr {B}_{\mathrm {FL_{GD}}}\) decrease as \(\eta \) gets larger while \(\mathscr {B}_{\mathrm {FL_{NGD}}}\) hardly changes. This can be explained as follows. When \({\varvec{a}}^*\) is sparse and \(\Vert {\varvec{b}}\Vert ^2_\Sigma \) is tiny, from Corollary 7, the bias of the upper k components is decreased to less than the estimation error in the first layer while that of the rest of components are the same. Therefore, if the weights of the difference \(\tilde{{\varvec{a}}}-{\varvec{a}}^*\) are concentrated in the upper components, the bias can be further reduced by increasing \(\eta \). In contrast, if the weights are distributed more widely after k, increasing \(\eta \) does not reduce the bias much. Then, from the proof of Corollary 3, the error between \(\tilde{{\varvec{a}}}\) updated by NGD and \({\varvec{a}}^*\) is isotropically distributed throughout the whole space. Hence, since \(k=o(d)\), the weights of \(\tilde{{\varvec{a}}}-{\varvec{a}}^*\) are mostly distributed over the components after k and \(\mathscr {B}_{\mathrm {FL_{NGD}}}\) does not decrease even if \(\eta \) is increased. On the other hand, the weights of the error between \(\tilde{{\varvec{a}}}\) updated by GD and WLR and \({\varvec{a}}^*\) are distributed to the upper k components and the components after k. Thus, \(\mathscr {B}_{\mathrm {FL_{WLR}}}\) and \(\mathscr {B}_{\mathrm {FL_{GD}}}\) decrease with increasing \(\eta \).
Next, we discuss the effect of the step size \(\eta \) to the variance. Figure 7 shows that the variance of each estimator as a function of the step size \(\eta \).
From Fig. 7 , we can see that the variance of any feature learning method is almost the same as that of the random feature model. This is in accordance with the note of Theorem 4 that the eigenvalues of \(\Sigma ^\frac{1}{2}W_1W_1^\top \Sigma \) are almost the same as \(\Sigma ^\frac{1}{2}W_0W_0^\top \Sigma \) even though \(W_0\) is updated with a rank-1 matrix to \(W_1\).
Finally, we show the predictive error of each estimator as a function of the step size \(\eta \) in Fig. 8. Note that the predictive error is the sum of the bias and variance \(\mathscr {B}_{\textrm{ES}}+\mathscr {V}_{\textrm{ES}}\), which can be obtained as the sum of the graphs in Figures 6 and 7.
Figure 8 shows that the estimator of the feature learning model outperforms that of the one-layer linear regression model not only in terms of bias but also in terms of overall predictive error, especially with NGD feature learning. Here, in this experiment, the variance is about one order of magnitude smaller than the bias, and thus the behavior of the predictive error in Fig. 8 is almost dominated by the bias. However, if we multiply \(\sigma ^2_\epsilon \) by 10 and let the bias and variance to the same magnitude, the behavior of the predictive error will still follow the bias approximately. This is because, from Figs. 6 and 7, the difference of bias induced by feature learning is about two orders of magnitude larger than that of the variances.
5 Conclusion and future work
Summary We analyzed the effect of one-step feature learning for the two-layer linear neural network to its predictive error in a high dimensional regression setting and gave detailed evaluations of the bias and variance in two step size regimes (small step size and large step size). In particular, for the small step size, we gave the comparison between the predictive error of the estimator obtained by feature learning and that given by one-layer linear regression in a concrete form, and confirmed the correspondence with numerical experiments. Theoretical and experimental results showed that feature learning is more effective when the decay rate is slow \(\alpha <0\) in both cases.
Future Work In this study, we gave the analysis of a one step feature learning model for the case where the number of neurons is sufficiently large, but the analysis with a finite width network is important because a neural network with a finite number of neurons is used in practice. In addition, this study analyzed a linear neural network without a nonlinear activation function in the intermediate layer, so the estimator of this model has serious limitations as it can only represent linear functions on input data. Therefore, the analysis of feature learning models including a nonlinear activation function is an extremely important issue.
Data availability
The authors declare that the data supporting the findings of this study are available within this paper.
Notes
A random variable X is sub-gaussian if it has a finite sub-gaussian norm \(\Vert X\Vert _{\psi _2}:=\) \(\inf ~\{t>0|\mathbb {E}[\exp (X^2/t^2)]\le 2\}\).
References
Amari, S., Ba, J., Grosse, R.B., Li, X., Nitanda, A., Suzuki, T., Wu, D., Xu, J.: When does preconditioning help or hurt generalization? In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=S724o4_WB3
Azulay, S., Moroshko, E., Nacson, M.S., Woodworth, B.E., Srebro, N., Globerson, A., Soudry, D.: On the implicit bias of initialization shape: beyond infinitesimal mirror descent. In: International Conference on Machine Learning, pp. 468–477. PMLR (2021)
Ba, J., Erdogdu, M.A., Suzuki, T., Wang, Z., Wu, D., Yang, G.: High-dimensional asymptotics of feature learning: how one gradient step improves the representation. In: Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=akddwRG6EGi
Bartlett, P.L., Long, P.M., Lugosi, G., Tsigler, A.: Benign overfitting in linear regression. Proc. Natl. Acad. Sci. 117(48), 30063–30070 (2020)
Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. 116(32), 15849–15854 (2019)
Chatterji, N.S., Long, P.M., Bartlett, P.L.: The interplay between implicit bias and benign overfitting in two-layer linear networks. J. Mach. Learn. Res. 23(263), 1–48 (2022)
Cheng, C., Montanari, A.: Dimension free ridge regression (2022). arXiv preprint arXiv:2210.08571
Damian, A., Lee, J., Soltanolkotabi, M.: Neural networks can learn representations with gradient descent. In: Conference on Learning Theory, pp. 5413–5452. PMLR (2022)
Frei, S., Chatterji, N.S., Bartlett, P.: Benign overfitting without linearity: neural network classifiers trained by gradient descent for noisy linear data. In: Conference on Learning Theory, pp. 2668–2703. PMLR (2022)
Hastie, T., Montanari, A., Rosset, S., Tibshirani, R.J.: Surprises in high-dimensional ridgeless least squares interpolation. Ann. Stat. 50(2), 949–986 (2022)
Imaizumi, M., Fukumizu, K.: Deep neural networks learn non-smooth functions effectively. In: International Conference on Artificial Intelligence and Statistics, pp. 869–878. PMLR (2019)
Jastrzebski, S., Szymczak, M., Fort, S., Arpit, D., Tabor, J., Cho, K., Geras, K.: The break-even point on optimization trajectories of deep neural networks. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=r1g87C4KwB
Li, Z., Zhou, Z.-H., Gretton, A.: Towards an understanding of benign overfitting in neural networks (2021). arXiv preprint arXiv:2106.03212
Louart, C., Liao, Z., Couillet, R.: A random matrix approach to neural networks. Ann. Appl. Probab. 28(2), 1190–1248 (2018)
Mei, S., Montanari, A.: The generalization error of random features regression: precise asymptotics and the double descent curve. Commun. Pure Appl. Math. 75(4), 667–766 (2022)
Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., Srebro, N.: The role of over-parametrization in generalization of neural networks. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=BygfghAcYX
Sun, D., Sun, J.: Strong semismoothness of eigenvalues of symmetric matrices and its application to inverse eigenvalue problems. SIAM J. Numer. Anal. 40(6), 2352–2367 (2002)
Suzuki, T.: Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=H1ebTsActm
Tsigler, A., Bartlett, P.L.: Benign overfitting in ridge regression (2020). arXiv preprint arXiv:2009.14286
Xu, J., Hsu, D.J.: On the number of variables to use in principal component regression. In: Advances in Neural Information Processing Systems, vol. 32, pp. 5095–5104 (2019)
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64(3), 107–115 (2021)
Acknowledgements
HN was partially supported by JST CREST (JPMJCR2115). TS was partially supported by JSPS KAKENHI (20H00576) and JST CREST (JPMJCR2015).
Funding
Open Access funding provided by The University of Tokyo.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The corresponding author states that there is no Conflict of interest.
Additional information
Communicated by Noboru Murata.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Proof of Sect. 2
1.1 Proof of (5)
Since
the first equation holds. Here, const. denotes a term independent of \({\varvec{a}}\).
Next, we prove the second equation. This can be directly derived from Appendix A of Tsigler and Bartlett [19]. Applying Sherman-Morrison formula yields
Therefore,
Proof of Sect. 3
1.1 Proof of Theorem 1
As for the Bias, we have the statement from
Similarly, one can easily obtain the variance equation.
1.2 Proof of Lemma 1
From singular value decomposition, note that \(\Sigma ^{\frac{1}{2}}W_1 = \tilde{U} \begin{pmatrix}\tilde{\Sigma }^{\frac{1}{2}}&O_{d,N-d}\end{pmatrix} \tilde{V}^\top \). Here, \(\tilde{\Sigma }\) is a diagonal matrix of the eigenvalues of \(\Sigma ^{\frac{1}{2}}W_1 W_1^\top \Sigma ^{\frac{1}{2}}\) in increasing order. From this,
holds. Hence, we have the result from
1.3 Proof of Lemma 3
From Corollary 1, it is clear that the second term of Theorem 2 does not depend on \(W_1^\star \). As for the first term, note that \(\Sigma _{0:k}\) is not singular, k columns of \(V_{0:k}\) span Im\((W_1^\top )\subset \mathbb {R}^{N}\). Here, the set of all \(W_1^\star \) is given as \( \{W_1^\dagger + A\in \mathbb {R}^{N\times d}~|\) each column of A is in Ker\((W_1)\}\). Because \(\mathbb {R}^N=\text{ Im }(W_1^\top )\oplus \text{ Ker }(W_1)\), \(V_{0:k}^\top A=O_{k,d}\) holds and the value of \(V_{0:k}^\top W_1^\star \) is constant for any \(W_1^\star \).
1.4 Proof of Theorem 3
In this proof, we usually omit \(o(\varepsilon )\).
Let \(\Sigma ^{\frac{1}{2}}W_0 = U \begin{pmatrix}{\Sigma }^{\frac{1}{2}}&O_{d,N-d}\end{pmatrix}V^\top \) and \(\Sigma ^{\frac{1}{2}}W_1 = \tilde{U} \begin{pmatrix}\tilde{\Sigma }^{\frac{1}{2}}&O_{d,N-d}\end{pmatrix}\tilde{V}^\top \). Here, from \(W_0 W_0^\top = I_d\), note that \(U=I_d\) and \(V_{0:d} = W_0^\top \). Furthermore, letting \(\tilde{\Sigma } = \text{ diag }(\tilde{\lambda }_1,\dots ,\tilde{\lambda }_d)\) does not contradict the notation of Theorem 1.
Now, because \(\Sigma ^{\frac{1}{2}}W_1 W_1^\top \Sigma ^{\frac{1}{2}} = \Sigma + \varepsilon \Sigma ^{\frac{1}{2}}\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top \Sigma ^\frac{1}{2}\) holds and the eigenvalues of \(\Sigma \) are not degenerate,
holds from Theorem 6. In the following, let \(\tilde{U} = U + \delta U\) and \(\tilde{\Sigma } = \Sigma + \delta \Sigma \). Here, note that \(\delta U\) is an alternating matrix.
Then, we calculates perturbation expansion of each term of \(\bar{B}_1\) and \(\bar{V}_1\) by using above equations. First, as for \(\Vert \tilde{V}_{0:k}^\top W_1^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{0:k}^{-1}}\), because
we can let \(W_1^\star =(\Sigma ^\frac{1}{2}W_1)^\dagger \Sigma ^\frac{1}{2}\). Hence,
Thus,
holds. As for the third term of (8),
holds. Here, when \(\tilde{{\varvec{a}}}\) almost approximates \({\varvec{a}}^*\) and \(\tilde{{\varvec{a}}}_i{\varvec{a}}^*_i>0,\tilde{{\varvec{a}}}_j{\varvec{a}}^*_j>0\) holds for any i, j, the former term is positive and the latter term is negative. Furthermore, when \({\varvec{a}}^*\) is sparse, the third term is zero because diagonal components of \(\Sigma ^{-\frac{1}{2}}\delta U \Sigma ^\frac{1}{2}\) are zero.
Next, as for \(\left( \frac{\lambda ^{(2)}+\sum _{i>k}\tilde{\lambda }_i}{n}\right) ^2\),
holds.
From the above, the first term of \(\bar{B}_1\) is represented as follows:
Next, we analyze the second term of \(\bar{B}_1\). First,
holds. Here, let \(\delta U = \begin{pmatrix}\delta U_{k,k}&{}\delta U_{k,d-k}\\ \delta U_{d-k,k}&{}\delta U_{d-k,d-k}\end{pmatrix}\) and we use the fact that \(\delta U\) is an alternating matrix to derive the above identity.
From that, by using Corollary 1, the second term is represented as follows:
Here, the second term of (9) implies the effect of mixing the first k components and the rest of the components, and then this is represented as follows:
Similarly to the third term of (8), this is positive when \(\tilde{{\varvec{a}}}_i{\varvec{a}}^*_i>0,\tilde{{\varvec{a}}}_j{\varvec{a}}^*_j>0\) for any i, j.
From the above, we obtain the desired result of the bias.
Next, as for the variance,
holds. Hence, we also obtain the results of the variance.
1.5 Proof of Theorem 2
The proof of Theorem 2 is clear if we show
for any \(i,j~(i\ne j)\).
First, the following holds:
Now, as for the second term, noting that X and \(\epsilon \) are independent,
holds. Similarly, the expectation of the third term is also zero. As for the fourth term, noting that each component of X and \({\varvec{\epsilon }}\) is all independent of each other,
holds. Here, (10) holds from the fact that \(\mathbb {E}_{X,\epsilon }[{\varvec{\epsilon }}_j{\varvec{\epsilon }}_{j'}]=\mathbb {E}_{X,\epsilon }[{\varvec{\epsilon }}_j]\mathbb {E}_{X,\epsilon }[{\varvec{\epsilon }}_{j'}]=0\) for \(j,~j'~(j\ne j')\). Finally, as for the first term,
holds. From the above, we have the desired result.
1.6 Proof of Corollary 3
When \({\varvec{a}}^*\) is sparse, \(\bar{B}_0-\bar{B}_1\) is represented as follows:
Hence, when \(k^*\le k\),
holds. Now, from \(\tilde{{\varvec{a}}}=(\Sigma +\lambda ^{(1)}I_d)^{-1}{X^\top {\varvec{y}}}/n\), noting that
we have
Next, we evaluate \(\mathbb {E}_{X,\epsilon }[(\tilde{{\varvec{a}}}_{i}-{\varvec{a}}^*_{i})^2]\). Noting that \({\varvec{a}}^*_i=0\) for \(i>k\), we get
Hence,
holds. Therefore,
holds. From the above, we obtain the desired result.
As for the variance, note that
for \(i>k\). Then, we have
1.7 Proof of Corollary 5
From the evaluation of series with the piecewise quadrature,
holds. And then, we get the result.
1.8 Proof of Theorem4
Similarly to the proof of Theorem 3, we can let \(W_1^\star =(\Sigma ^\frac{1}{2}W_1)^\dagger \Sigma ^\frac{1}{2}\). Then, similarly to the above, letting \(\tilde{U}\), \(\tilde{V}\), and \(\tilde{\Sigma }^\frac{1}{2}\) be left singular vectors, right singular vectors, and singular values of \(\Sigma ^\frac{1}{2}W_1\),
holds.
Here, we let \(W_1(\eta )=W_0+\eta \tilde{{\varvec{a}}}{\varvec{a}}^\top \) and \(\tilde{U}(\eta )\) be left singular vectors of \(\Sigma ^\frac{1}{2}W_1(\eta )\). Noting that \(\tilde{U}(\eta )\) are also eigenvectors of \(\Sigma ^\frac{1}{2}W_1(\eta )W_1(\eta )^\top \Sigma ^\frac{1}{2}/\eta ^2=\Sigma /\eta ^2+\Sigma ^{\frac{1}{2}}\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top \Sigma ^{\frac{1}{2}}\), \(\tilde{U}:=\lim _{\eta \rightarrow \infty }\tilde{U}(\eta )\) exists from Corollary 4.3 of Sun and Sun [17].
In this case, from \(\tilde{\lambda }_1\rightarrow \infty ~(\text{ as }~\eta \rightarrow \infty )\), \(\tilde{\lambda }_1^{-\frac{1}{2}}\tilde{{\varvec{u}}}_1^\top \Sigma ^\frac{1}{2}{\varvec{a}}^*\rightarrow 0.\). So, in the following, we analyze \(\tilde{{\varvec{u}}}_i^\top \Sigma ^\frac{1}{2}{\varvec{a}}^*\) for \(i>1\). First, we let
Next, we expand \(\tilde{{\varvec{u}}}_i\) with \(\{{\varvec{u}}_i\}_{i=1}^d\) which is eigenvectors of A: \(\tilde{{\varvec{u}}}_i = \sum _{j=1}^d d_{ij}{\varvec{u}}_j\). Then, \(\tilde{{\varvec{u}}}_i^\top \Sigma ^\frac{1}{2}{\varvec{a}}^*\rightarrow d_{i1} ~(\text{ as }~\eta \rightarrow \infty )\) holds. This is because \({\varvec{u}}_1\rightarrow \Sigma ^\frac{1}{2}{\varvec{a}}^*\) from the fact that \(A=\Sigma +\eta ^2c^2\Sigma ^\frac{1}{2}{\varvec{a}}^*(\Sigma ^\frac{1}{2}{\varvec{a}}^*)^\top \) and \(\Vert \Sigma ^\frac{1}{2}{\varvec{a}}^*\Vert ^2 = \Vert {\varvec{a}}^*\Vert _\Sigma ^2 = 1\). Thus, we analyze \(d_{i1}\) in the following.
Now, noting that \((A+B)\tilde{{\varvec{u}}}_i = \tilde{\lambda }_i \tilde{{\varvec{u}}}_i\), we have
Here, denote \(\lambda _j^{A}\) to be the j-th largest eigenvalues of A. Multiplying (12) by \({\varvec{u}}_1^\top \) from the left, we get
As for \({\varvec{u}}_1^\top B/\eta ^2\),
holds. Then, as for \(\lambda ^A_1\),
holds. Hence, \(\lambda _1^A/\eta ^2\rightarrow c^2\,(\text{ as }\,\eta \rightarrow \infty ).\) Furthermore, as shown later, \(\tilde{\lambda }_i/\eta ^2\rightarrow 0\,(\text{ as }\,\eta \rightarrow \infty ).\) holds. Therefore we obtain
Next, we derive the evaluation of eigenvalues. From Corollary 2, \(\lambda _i \le \tilde{\lambda }_i\) holds for any \(i\in [d]\). As for \(\sum _{i>k}\tilde{\lambda }_i\), \(\sum _{i>k}\tilde{\lambda }_i=\sum _{i=k+1}^{d}\tilde{\lambda }_i\) holds because \(\tilde{\lambda }_i=0\) for any \(i>d\). Here, noting that the trace is linear mapping: \(\text{ tr }(A+B)=\text{ tr }(A)+\text{ tr }(B)\),
olds. Furthermore, note that
From the above, we have
Lemma 2
Suppose \(A\in \mathbb {R}^{n \times n}\) is a real symmetric matrix and \(B\in \mathbb {R}^{n \times n}\) is a positive-semidefinite matrix. Then, for any \(i\in [n]\), \(\mu _i(A)\le \mu _i(A+B)\) holds.
Proof
Without loss of generality, we can assume that A is a diagonal matrix with eigenvalues arranged in descending order. Note that, for any real symmetric matrix C and \(i\in [n]\),
holds. Then,
holds. Here, define \(\mathbb {R}^{i}\oplus \{{\varvec{0}}_{n-i}\}:=\{{\varvec{v}}\in \mathbb {R}^n|{\varvec{v}}_{i+1}={\varvec{v}}_{i+2}=\cdots ={\varvec{v}}_n=0\}\). From the above, we have the desired result. \(\square \)
1.9 Proof of Corollary 6
Noting Theorem 1, the following holds from Theorem 7 of [19]. First, as for the case 1,
-
1.
If there exists some constants \(\tilde{C}_x\) that only depend on \(\sigma _x\) and \(n\tilde{\lambda }_{k+1}\ge \tilde{C}_x\sum _{i>k}\tilde{\lambda }_i\) for some \(k<n/c_x\), then for \(\lambda ^{(2)}=n\tilde{\lambda }_{k+1}\) and for any \(t\in (c_x,n/c_x)\), with probability at least \(1-26e^{-t/c_x}\)
$$\begin{aligned} \frac{\mathscr {B}}{D_x}\le \tilde{\lambda }_k^2\Vert V^\top _{0:k}W^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{0:k}^{-1}}+\Vert V^\top _{k:\infty }W^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{k:\infty }} \nonumber \end{aligned}$$holds. Then, from Corollary 1 and Theorem 4,
$$\begin{aligned} \tilde{\lambda }_k^2\Vert V^\top _{0:k}W^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{0:k}^{-1}}+\Vert V^\top _{k:\infty }W^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{k:\infty }}= & {} \tilde{\lambda }_k^2\Vert \tilde{U}^\top _{1:k}\Sigma ^\frac{1}{2}{\varvec{b}}^*\Vert ^2_{\Sigma ^{-2}_{1:k}}+\Vert \tilde{U}^\top _{k:\infty }\Sigma ^\frac{1}{2}{\varvec{b}}^*\Vert ^2 \nonumber \\\le & {} \Vert {\varvec{b}}^*\Vert ^2_{\Sigma }.\nonumber \end{aligned}$$Furthermore, from remarks of Theorem 4, noting that \(\tilde{\lambda }_i-\lambda _i=o(1)\), \(n{\lambda }_{k+1}\ge {C}_x\sum _{i>k}{\lambda }_i\) holds if \(n\tilde{\lambda }_{k+1}\ge \tilde{C}_x\sum _{i>k}\tilde{\lambda }_i\).
Similarly, as for the case 2,
-
2.
If \(\sum _{i>k}\lambda _i\ge c_xn\lambda _{k+1}\) for some \(k<n/c_x\), then for \(\xi >c_x\) and \(\lambda ^{(2)}=-\sum _{i>k}\lambda _i+\xi \left( n\lambda _1+\sqrt{n\sum _{i>k}\lambda _i^2}\right) \), for any \(t\in (c_x,n/c_x)\) with probability at least \(1-20e^{-t/c_x}\)
$$\begin{aligned} \frac{\mathscr {B}}{D_x}\le & {} \xi \left( \lambda _{k+1}^2+\frac{\sum _{i>k}\lambda _i^2}{n}\right) \Vert V^\top _{0:k}W^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{0:k}^{-1}}+\Vert V^\top _{k:\infty }W^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{k:\infty }} \nonumber \\= & {} \xi \left( \lambda _{k+1}^2+\frac{\sum _{i>k}\lambda _i^2}{n}\right) \Vert \tilde{U}^\top _{1:k}\Sigma ^\frac{1}{2}{\varvec{b}}^*\Vert ^2_{\Sigma ^{-2}_{1:k}}+\Vert \tilde{U}^\top _{k:\infty }\Sigma ^\frac{1}{2}{\varvec{b}}^*\Vert ^2 \nonumber \\\le & {} \xi \left( 1+\frac{\sum _{i>k}\lambda _i^2}{n\lambda _k^2}\right) \Vert {\varvec{b}}^*\Vert ^2_{\Sigma }+\Vert {\varvec{b}}^*\Vert ^2_{\Sigma }\nonumber \end{aligned}$$holds. Especially, if \(\lambda _i=i^{-(1+\alpha )}~(-0.5<\alpha <0)\), \(\frac{\sum _{i>k}\lambda _i^2}{n\lambda _k^2}\le \frac{k}{n}<1\) holds because \(\lambda ^2_i=i^{-(1+\beta )}~(\beta >0)\).
1.10 Proof of Theorem 7
Define \({\varvec{b}}^*={\varvec{b}}/\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*\).
First, define orthogonal matrices \(P=({\varvec{p}}_1,\dots ,{\varvec{p}}_d)\in \mathbb {R}^d\) as follows:
Then,
Thus, for any \(i\in [d]\), the sum part of \({\varvec{p}}_i\) is bounded by \(\mathcal {O}(\Vert {\varvec{b}}^*\Vert _\Sigma ) \).
Now, the second term of \(P^\top \Sigma ^\frac{1}{2}W_1W_1^\top \Sigma ^\frac{1}{2}P=P^\top \Sigma P+\eta ^2P^\top \Sigma ^\frac{1}{2}\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top \Sigma ^\frac{1}{2} P\) is a matrix where only the (1,1) component is \(\eta ^2\Vert \tilde{{\varvec{a}}}\Vert ^2_\Sigma \) and the rest is zero. Furthermore, non-diagonal components are \(\mathcal {O}(\Vert {\varvec{b}}^*\Vert _\Sigma )\). Hence, \(P^\top \Sigma ^\frac{1}{2}W_1W_1^\top \Sigma ^\frac{1}{2}P\) is a matrix which is a diagonal matrix added a \(\mathcal {O}(\Vert {\varvec{b}}^*\Vert _\Sigma )\) matrix. Thus \(P^\top \Sigma ^\frac{1}{2}W_1W_1^\top \Sigma ^\frac{1}{2}P\) can be diagonalized by an orthogonal matrix Q where \(Q=I_d+\mathcal {O}(\Vert {\varvec{b}}^*\Vert _\Sigma )\) from Theorem 6.
Therefore, because \(\tilde{U}=PQ\) for left singular vectors of \(\Sigma ^\frac{1}{2}W_1\) :\(\tilde{U}\), we have desired result.
Additional experiments
This section contains numerical experiments on facts mentioned in the text that are not directly related to the results of this study.
We first see that \(\hat{{\varvec{a}}}_{\textrm{RF}}\rightarrow \hat{{\varvec{a}}}_{\textrm{LR}}~(N/d\rightarrow \infty )\) holds as noted in Sect. 3. Figure 9a–c show \(\Vert \hat{{\varvec{a}}}_{\textrm{RF}}-\hat{{\varvec{a}}}_{\textrm{LR}}\Vert ^2\) averaged over 100 times when N/d is increased from 1 to 100.
Although the value of \(\Vert \hat{{\varvec{a}}}_{\textrm{RF}}-\hat{{\varvec{a}}}_{\textrm{LR}}\Vert ^2\) is different in each graph, in any \({\varvec{a}}^*\), \(\Vert \hat{{\varvec{a}}}_{\textrm{RF}}-\hat{{\varvec{a}}}_{\textrm{LR}}\Vert ^2\) converges to 0 when N/d is increased.
We next see that when the decay rate of the eigenvalues is fast and \(k^*\) is small, the discrepancy between the estimator of the random feature model and that of the one-layer linear regression model is large. Figure 10 shows \(\mathscr {B}_{RF}/\mathscr {B}_{LR}\) with \(N/d=10\) fixed and \(\alpha \) and \(k^*\) changed respectively.
From Fig. 10, when \(\alpha \) is large and \(k^*\) is small, \(\mathscr {B}_{RF}/\mathscr {B}_{LR}\) is significantly large and \(\hat{{\varvec{a}}}_{\textrm{RF}}\) and \(\hat{{\varvec{a}}}_{\textrm{LR}}\) are highly divergent. This is because when \(\alpha \) is large and \(k^*\) is small, the bias of one-layer linear regression is very small.
We lastly present some comparisons of the experimental results with and without data splitting. Figure 11a, c shows the relative bias in settings where data are split, while Fig. 11b, d shows that in settings without data splitting.
Although each plot except blue plots shows qualitatively the same behavior, the bias values for the settings where data are split are uniformly about 0.3 smaller than that without data splitting. This means the bias improves more with feature learning in settings where the model is trained on split data. This is just because feature learning models use twice as much data for training as random feature models and linear regression models in such situations.
Perturbation theory of matrices
This section presents various theorems on perturbations of matrices that are necessary for the analysis in Sect. 4.
Theorem 5 characterizes the condition that a matrix’s eigenvalues respond in the first order when it is subjected to a first-order perturbation. This is not always trivial, as can be seen from the matrix below:
Theorem 5
Let \(A\in \mathbb {R}^{n\times n}\) be a diagonalizable square matrix and \(B \in \mathbb {R}^{n\times n}\) be any square matrix. In this case, for any real number \(\varepsilon \) sufficiently small, the eigenvalue \(\tilde{\lambda }(\varepsilon )\) of \(A+\varepsilon B\) is represented by \(\tilde{\lambda }(\varepsilon )=\lambda _i+c\varepsilon +\mathcal {O}(\varepsilon ^2)\), given a constant c and the eigenvalue \(\lambda _i\) of A.
Proof
Without loss of generality, let A be a diagonal matrix. So, define \(A:=\text{ diag }(A_1,A_2,\dots ,A_m)\) and \(A_i:=\lambda _iI_{d_i}\). Then, \(\text{ det }(xI-(A+\varepsilon B))=f(x)+g(x,\varepsilon )=:\tilde{f}(x,\varepsilon )\) holds with \(f(x):=\text{ det }(xI-A)\). Now, since \(g(x,\varepsilon )\) is obviously a polynomial of high n degree in x and \(\varepsilon \) and A is a diagonal matrix, all terms in g have \(\varepsilon \) factors.
We first prove the zeroth order expansion factor of \(\tilde{\lambda }(\varepsilon )\). That is, \(\lim _{\varepsilon \rightarrow 0}\tilde{\lambda }(\varepsilon )=\lambda _i\) holds for some \(i\in [m]\). Since \(\tilde{\lambda }(\varepsilon )\) is an eigenvalue of \(A+\varepsilon B\), \(\tilde{f}(\tilde{\lambda }(\varepsilon ),\varepsilon )=0\) holds. Therefore, noting that A is a diagonal matrix,
holds.
Now, note that \(\tilde{\lambda }(\varepsilon )\le \Vert A+\varepsilon B\Vert _{\text{ op }}\le \Vert A\Vert _{\text{ op }}+\Vert B\Vert _{\text{ op }}<\infty \) when \(\varepsilon \le 1\) and each term of \(g(x,\varepsilon )\) contains a factor of \(\varepsilon \). Then \(\lim _{\varepsilon \rightarrow 0} g(\tilde{\lambda }(\varepsilon ),\varepsilon )=0\) holds. Therefore, if \(\lim _{\varepsilon \rightarrow 0}\tilde{\lambda }(\varepsilon ) = \lambda _i\) does not hold for any \(i\in [m]\), then the left side of (14) is not zero in the limit of \(\varepsilon \rightarrow 0\), and this is a contradiction. Hence, \(\lim _{\varepsilon \rightarrow 0}\tilde{\lambda }(\varepsilon ) = \lambda _i\) holds for some \(i \in [m]\).
We next prove that there exists a first-order perturbative expansion of \(\tilde{\lambda }(\varepsilon )\). Hereafter, \(i\in [m]\) is assumed to satisfy \(\lim _{\varepsilon \rightarrow 0}\tilde{\lambda }(\varepsilon ) = \lambda _i\). This proof is almost clear by showing that the polynomial \(g(\lambda _i,\varepsilon )\) for \(\varepsilon \) contains a factor \(\varepsilon ^{d_i}\) in every term. Thus, we first prove this.
\(g(\lambda _i,\varepsilon )=f(\lambda _i)+g(\lambda _i,\varepsilon )=\text{ det }((\lambda _iI-A)-\varepsilon B)\), but the matrix \((\lambda _iI-A)-\varepsilon B)\) contains only \(\varepsilon B\) components in the \(d_i\) row where the diagonal matrix \(A_i\) is stored. Thus, all nonzero terms in the determinant \(\text{ det }((\lambda _iI-A)-\varepsilon B)\) involve a factor of \(\varepsilon ^{d_i}\).
Now, because \(\tilde{\lambda }(\varepsilon )\ne \lambda _j (i\ne j)\) for a sufficiently small \(\varepsilon \), we have
from (14).
Considering the limit of \(\varepsilon \rightarrow 0\) on the right-hand side, for the numerator \(\lim _{\varepsilon \rightarrow 0}g(\tilde{\lambda }(\varepsilon ),\varepsilon )/\varepsilon ^{d_i}=\lim _{\varepsilon \rightarrow 0}g(\lambda _i,\varepsilon )/\varepsilon ^{d_i} < \infty \) holds. The first equality is understood from the fact that each term of \(g(\tilde{\lambda }(\varepsilon ),\varepsilon )\) is a polynomial consisting of the product of \(\tilde{\lambda }(\varepsilon )\) and \(\varepsilon \), and the limit can be decomposed for the product. The second inequality is as noted that the polynomial \(g(\lambda _i,\varepsilon )\) in \(\varepsilon \) begins with the term \(\varepsilon ^{d_i}\). For the denominator limit, \(\lim _{\varepsilon \rightarrow 0}\prod _{j\ne i}(\tilde{\lambda }(\varepsilon )-\lambda _j)^{d_j}=\prod _{j\ne i}(\lambda _i-\lambda _j)^{d_j}\ne 0\) holds.
From the above, letting \(c=\lim _{\varepsilon \rightarrow 0}-(g(\tilde{\lambda }(\varepsilon ),\varepsilon )/\varepsilon ^{d_i})/{\prod _{j\ne i}(\tilde{\lambda }(\varepsilon )-\lambda _j)^{d_j}}<\infty \), we have \(\tilde{\lambda }(\varepsilon )=\lambda _i+c\varepsilon +\mathcal {O}(\varepsilon ^2)\). \(\square \)
From here, we show that the eigenvectors respond in first order when perturbed to a matrix whose eigenvalues are not degenerate. In general, when eigenvalues are degenerate, even if the eigenvalues are diagonalizable, there are degrees of freedom in how the eigenspace is grounded, making perturbation analysis of the eigenvectors more difficult.
Corollary 8
If the matrix \(A\in \mathbb {R}^{n\times n}\) has no degenerate eigenvalues, then for any sufficiently small real number \(\varepsilon \) and any matrix \(B \in \mathbb {R}^{n\times n}\), \(A+\varepsilon B\) has no degenerate eigenvalues.
Proof
Fix a sufficiently small \(\varepsilon \). Assume the existence of degenerate eigenvalues \(\tilde{\lambda }\) of \(A+\varepsilon B\). From Theorem 5, there exists \(i\in [n]\) and \(\tilde{\lambda }=\lambda _i+\mathcal {O}(\varepsilon )\) holds.
Now, from the assumption,
holds. On the other hand,
Therefore, \(A+\varepsilon B\) has no degenerate eigenvalues. \(\square \)
From Theorem 5 and Corollary 8, we have the following corollary.
Corollary 9
If the matrix \(A\in \mathbb {R}^{n\times n}\) has n distinct eigenvalues \(\{\lambda _1,\dots ,\lambda _n\}\), then for any sufficiently small real number \(\varepsilon \) and any square matrix \(B\in \mathbb {R}^{n\times n}\), the matrix \(A+\varepsilon B\) has n different eigenvalues \(\{\tilde{\lambda }_1,\dots ,\tilde{\lambda }_n\}\) and each is represented as \(\tilde{\lambda _i}=\lambda _i +c_i\varepsilon +\mathcal {O}(\varepsilon ^2)\) with constants \(c_1,\dots ,c_n\).
Lemma 3
Let \(A\in \mathbb {R}^{n\times n}\) be a matrix whose eigenvalues are not degenerate and \(U\in \mathbb {R}^{n\times n}\) be its basis matrix. That is, \(U^{-1}AU\) is diagonal matrix. Then, for any sufficiently small real number \(\varepsilon \) and any square matrix \(B\in \mathbb {R}^{n\times n}\), there exists a basis matrix \(\tilde{U}(\varepsilon )\) of \(A+\varepsilon B\) and \(\lim _{\varepsilon \rightarrow 0}\tilde{U}(\varepsilon )=U\) holds.
Proof
Define \(U = ({\varvec{u}}_1,\dots ,{\varvec{u}}_n)\) and \(\tilde{U}(\varepsilon )=\left( \tilde{{\varvec{u}}}_1(\varepsilon ),\dots ,\tilde{{\varvec{u}}}_n(\varepsilon )\right) \). The existence of \(\tilde{U}(\varepsilon )\) is clear from Corollary 9. Then, \((\tilde{\lambda }_i(\varepsilon )I-(A+\varepsilon B))\tilde{{\varvec{u}}}_i(\varepsilon )={\varvec{0}}\) holds for each \(i\in [n]\) with any small \(\varepsilon \). Now, assume that \(\lim _{\varepsilon \rightarrow 0}\tilde{{\varvec{u}}}(\varepsilon ) = {\varvec{v}}\ne {\varvec{u}}_i\). Then, we have
However, each eigenspace is one-dimensional, since any eigenvalue of A is not degenerate. Therefore, \((\lambda _i I-A){\varvec{v}}\ne {\varvec{0}}\) from \((\lambda _i I-A){\varvec{u}}_i={\varvec{0}},{\varvec{v}}\ne {\varvec{u}}_i\). This is a contradiction. Then, eigenvectors are continuous. \(\square \)
Theorem 6 shows that the eigenspace has a first-order response for perturbation when the eigenvectors form an orthonormal basis. Furthermore, we give explicit form for first-order perturbations of eigenvalues and eigenvectors.
Theorem 6
Let \(A\in \mathbb {R}^{n \times n}\) be a normal matrix whose eigenvalues are not degenerate and \(B\in \mathbb {R}^{n \times n}\) be any square matrix. Let \(U=({\varvec{u}}_1,\dots ,{\varvec{u}}_n)\) be eigenvectors and \({\lambda _1,\dots ,\lambda _n}\) be eigenvalues of A, then for a sufficiently small real number \(\varepsilon \), \((A+\varepsilon B)\)’s eigenvectors \(\tilde{U}=\left( \tilde{{\varvec{u}}}_1(\varepsilon ),\dots ,\tilde{{\varvec{u}}}_n(\varepsilon )\right) \) and eigenvalues \(\{\tilde{\lambda }_1(\varepsilon ),\dots ,\tilde{\lambda }_n(\varepsilon )\}\) are given by
Proof
Note that since A is normal, U is an orthogonal matrix, i.e. \(U^\top U=I\). If we represent \(\tilde{\lambda }_i(\varepsilon )=\lambda _i+\delta \lambda _i(\varepsilon )\) and \(\tilde{{\varvec{u}}}_i(\varepsilon )={\varvec{u}}_i+{\varvec{\delta u}}_i(\varepsilon )\), then we have
If we represent \(\tilde{\lambda }_i(\varepsilon )=\lambda _i+\delta \lambda _i(\varepsilon )\) and \(\tilde{{\varvec{u}}}_i(\varepsilon )={\varvec{u}}_i+{\varvec{\delta u}}_i(\varepsilon )\), then we have
From Lemma 3, \(\lim _{\varepsilon \rightarrow 0}c_{ij}(\varepsilon )=0\). Then, since \(1+c_{ii}(\varepsilon )\ne 0\) for sufficiently small \(\varepsilon \), letting \(d_{ij}(\varepsilon )=c_{ij}(\varepsilon )/(1+c_{ii}(\varepsilon ))\), we have
Then, we have
Thus, noting \(\varepsilon d_{ij}(\varepsilon )=o(\varepsilon )\) and multiplying both sides by \({\varvec{u}}_i^\top \) from the left, we get
Then multiplying by \({\varvec{u}}_j^\top (i\ne j)\) from the left, we get
\(\square \)
Corollary 10
In particular, when A is a symmetric matrix, the eigenvalues and eigenvectors of \(A+\varepsilon B\) are given in the above form.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Nishimori, H., Suzuki, T. Feature learning and generalization error analysis of two-layer linear neural networks for high-dimensional inputs. Info. Geo. (2024). https://doi.org/10.1007/s41884-024-00142-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41884-024-00142-3