1 Introduction

In a standard learning theory, it is usually recommended not to overfit to the training data because overfitting leads to poor predictive performance. Then, several regularization or variable selection are usually applied to restrict the degrees of freedom of the model.

However, it has been reported for various models experimentally that overparametrized models generalize very well even when no regularization is applied [5, 16, 21]. This phenomenon is called benign overfitting and has been the subject of significant research in recent years. Theoretical analyses that explain benign overfitting are carried out as Bartlett et al. [4, 19], Hastie et al. [10], Xu and Hsu [20] in the one-layer linear regression model, and Li et al. [15], Louart et al. [14], Mei and Montanari [13] in the random feature model of the two-layer nonlinear neural network.

By the way, it has been reported in various situations that “deep” learning models such as two-layer neural networks outperform “shallow” learning models such as kernel methods under appropriate model sizes [11, 18]. This is considered to be because deep learning models can adaptively learn basis functions, which is known as feature learning. On the other hand, deep learning models show good generalization performance in practice without necessarily choosing an appropriate model size. In particular, the model size for deep learning is often larger than the sample size.

Ba et al. [3] gives an analysis of the predictive error of the estimators after one step feature learning of a two-layer neural network, and shows that the model with feature learning generalizes beyond the lower bound of the predictive error of the linear models such as kernel methods, especially when the sample size is larger than the model size. However, the analysis is given in the proportional limit, and finite-dimensional analysis has not been carried out. In another study, Damian et al. [8] analyzed the learning ability of a one step feature learning model in the setting that the output depends only a low dimensional subspace with dimensionality r in the d-dimensional input space. According to Damian et al. [8], the one-step gradient feature learning makes the model generalize even when \(n \asymp d^2 r + d r^p\) while random feature models require larger sample size \(n \gtrsim d^p\) where p is the degrees of polynomial to be estimated. However, the effect of feature learning in a situation where \(n<d\) is not reported. Therefore, it is an important question to study whether a model with feature learning achieves benign overfitting when the sample size is smaller than the input dimension.

In this study, we investigate three different feature learning procedures and their effect to the predictive performance in the interpolation regime where the model is the two-layer linear neural network. Specifically, the three procedures we investigate are (i) training the first layer by one-step vanilla gradient descent, (ii) that by natural gradient descent and (iii) that by linear regression; and on top of that, we obtain the minimum norm interpolator using the second layer. The predictive performances of the three procedures are analyzed in the finite-dimensional overparametrized regime, where the dimension of the input data is larger than the sample size, and then we compare the performances between the three procedures so that we reveal what kind of feature learning method is effective for benign overfitting. As a summary, we provide the following contributions in this paper:

First, we analyze a situation where the step size for the feature learning is small. We give an explicit formula of the difference between the predictive errors with and without feature learning. Then, we characterize when the feature learning can reduce the predictive error. We also confirm that our theoretical evaluation is consistent with numerical experiments. As a consequence, we see that, in “difficult” instances where the true signal is not well aligned to the input, the NGD can reduce the bias while the GD cannot. This indicates the effectiveness of NGD in the interpolation regime.

Second, we evaluate the generalization error when we employ a larger step size for feature learning. Then, we see that the generalization error after feature learning does not deteriorate compared with that of the non-feature learned linear interpolator. We also give a detailed theoretical analysis that characterizes the differences of predictive errors between the three approaches.

In the former, we find that the reduction of predictive errors induced by feature learning becomes larger as the eigenvalues of the covariance of the input decays more slowly, i.e., the regression problem becomes more difficult. Though it is known that when the eigenvalues are polynomially decaying with \(\lambda _i=i^{-(1+\alpha )}~(\alpha >0)\), the optimal minimax rate can be achieved by one-layer linear regression, we show theoretically and experimentally that feature learning significantly reduces the bias while feature learning does not change the variance when \(\alpha <0\).

Amari et al. [1] which studied the predictive error of linear interpolator with preconditioning reported a tradeoff in which the bias is reduced by preconditioning while the variance is increased. However, in this study, we show that only the bias decreases and the variance hardly increases. This significant difference is because we deal with a linear interpolator computed after feature learning while Amari et al. analyzed only a linear interpolator without feature learning. In addition, Ba et al. [3] analyzed the predictive error after one gradient step feature learning, however, they considered the proportional limit where n and d are taken limit to \(\infty \) where their ratio converges to a positive constant \(n \asymp d\). On the other hand, our analysis does not take the proportional limit and is valid in a general finite sample and finite dimensional setting. As a consequence, the reduction of the predictive error can be shown even in a setting of \(d = \omega (n)\).

1.1 Notations

Throughout this paper, for a vector \({\varvec{v}}\in \mathbb {R}^{n}\) and a positive definite symmetric matrix \(\Sigma \in \mathbb {R}^{n \times n}\), we let \(\Vert {\varvec{v}}\Vert ^2_\Sigma :={\varvec{v}}^\top \Sigma {\varvec{v}}\) and \(\Vert {\varvec{v}}\Vert _\Sigma :=\sqrt{\Vert {\varvec{v}}\Vert ^2_\Sigma }\). For a square matrix \(A\in \mathbb {R}^{n\times n}\), its operator norm is denoted by \(\Vert A\Vert _{\textrm{op}}\) and the i-th largest eigenvalue of A, allowing for ties, is denoted by by \(\mu _i(A)\). Also, for a matrix \(B\in \mathbb {R}^{n\times m}\), let \(B_{0:k}\in \mathbb {R}^{n\times k}\) denote the submatrix of B from which the left k columns are taken and \(B_{k:\infty }\in \mathbb {R}^{n\times m-k}\) denotes the submatrix of B consisting of the other columns. Similarly, for \({\varvec{v}}\in \mathbb {R}^n\), let \({\varvec{v}}_{0:k}\in \mathbb {R}^k\) be a vector consisting of the upper k components of \({\varvec{v}}\) and \({\varvec{v}}_{k:\infty }\in \mathbb {R}^{n-k}\) be a vector consisting of the other components. Let \(\mathcal {O}(\cdot )\) and \(o(\cdot )\) denote the Landau’s asymptotic big-O and little-o notations respectively, and when these notations are used for fixed finite demensional matrices and vectors, assume that the asymptotic property holds for each component. In addition, for a sequence of random variables \(X_n\), \(X_n = o_\mathbb {P}(1)\overset{\textrm{def}}{\iff }\) for every \(\varepsilon >0\), \(\lim _{n\rightarrow \infty }P(|X_n|\le \varepsilon )=1\) and \(X_n = o_\mathbb {P}(f(n))\overset{\textrm{def}}{\iff }\) \(X_n/f(n)=o_\mathbb {P}(1)\). Similarly, \(X_n = \mathcal {O}_\mathbb {P}(1)\overset{\textrm{def}}{\iff }\) for every \(\varepsilon >0\), there exist a constant \(K(\varepsilon )\) and an integer \(N(\varepsilon )\) such that if \(n\ge N(\varepsilon )\) then \(P(|X_n|\le K(\varepsilon ))\ge 1-\varepsilon \), and \(X_n = \mathcal {O}_\mathbb {P}(f(n))\overset{\textrm{def}}{\iff }\) \(X_n/f(n)=\mathcal {O}_\mathbb {P}(1)\).

1.2 Related work

Benign overfitting of linear estimator. Recent works provided analyses of linear models under high demensional input [1, 4, 10, 19, 20]. In particular, feature learning in the two-layer linear neural network can be regarded as a kind of preconditioning analyzed in Amari et al. [1]. However this result cannot be applied to this study because it considered only diagonalizable preconditioning. Benign overfitting was also analysed in the random feature model of the two-layer nonlinear neural network [13,14,15].

All of these estimator are linear estimators, that is, they are linear to the output data. However, the estimator we consider in this study is not linear because feature learning in the first layer induced nonlinearity to the output data.

Two-layer neural network model. Two-layer neural network with feature learning is one of the simplest deep learning models. It is known theoretically that appropriately sized deep learning estimators outperform linear estimators for certain classes of functions [11, 18]. However deep learning estimators perform well even when the appropriate model size is not chosen and then some works analyzed such situation [2, 3, 6, 8, 9]. In particular, it has been pointed out that the transition trajectories of the parameters in the learning process are highly dependent on the initial stage of learning [12], and Ba et al. [3] showed that the estimator with one step feature learning outperform linear estimators where sample size is larger than input dimension. On the other hand, we consider the estimators with one step feature learning where input dimension is larger than sample size. In addition, this study performs analysis on finite data while Ba et al. [3] did in proportional limit. The importance of analysis with finite data has been pointed out in detail by Cheng and Montanari [7].

1.3 Paper organization

Section 2 describes the learning model and input–output data setup that is the subject of theoretical analysis in this study, and provides an interpretation of the proposed model. In Sect. 3, we give an analysis of the predictive errors given by these learning methods and also refer to a brief note and discussion of the results. In Sect. 4, we perform numerical experiments related to the main results of Sect. 3 and discussion based on these results. Section 5 again summarizes this study and its results, and mentions future works. The proof of the analyses are performed in Sects. 6 and 7. Section 8 contains numerical experiments on facts not directly related to the analytical results of this study. Section 9 summarizes the perturbation theory of matrices necessary for the analysis.

2 Problem setup and training procedure

The goal of this study is to construct an estimator based on training data and theoretically evaluate the predictive error of this estimator. In Sect. 2.1, we describe the problem setting and assumptions on the underlying distribution in this study, which is consistent with Tsigler and Bartlett [19]. In Sect. 2.2, we explain the two-layer linear network we consider in this study and introduce the feature learning procedure.

2.1 Setting of input and output data

  1. 1.

    Suppose that we observe an i.i.d. sequence of input \(({\varvec{x}}_i)_{i=1}^n \in \mathbb {R}^d\) whose mean is zero and covariance matrix is \(\Sigma \). Without loss of generality, \(\Sigma \) can be transformed into a diagonal matrix with eigenvalues arranged in descending order by an appropriate change of basis, and then we assume that \(\Sigma \) = diag\((\lambda _1, \lambda _2, \dots ,\lambda _d) \) with \(\lambda _1 \ge \lambda _2\ge \cdots \ge \lambda _d\). In addition, we assume that the random variable \(\Sigma ^{-\frac{1}{2}}{\varvec{x}}_i\) (if \(\lambda _{k}\), diag) follows a sub-Gaussian distribution with sub-Gaussian norm \(\sigma _x\).Footnote 1

  2. 2.

    The output data \((y_i)_{i=1}^n\in \mathbb {R}\) are assumed to be observed as the value of a linear function \(f^*({\varvec{x}}_i)={\varvec{x}}_i^\top {\varvec{a}}^*\) with additive observation noise:

    $$\begin{aligned} y_i={\varvec{x}}_i^\top {\varvec{a}}^*+\epsilon _i\quad (i=1,\dots ,n). \nonumber \end{aligned}$$

    Here, \((\epsilon _i)_{i=1}^n\) follow a Gaussian distribution with mean 0 and variance \(\sigma _\epsilon ^2\) independently: \(\epsilon _i \overset{\mathrm {i.i.d.}}{\sim }\mathcal {N}(0,\sigma _\epsilon ^2)\). Furthermore, we assume \(\Vert {\varvec{a}}^*\Vert ^2_\Sigma =1\) without loss of generality.

In this study, we construct an estimator of a linear model \(\hat{f}({\varvec{x}})={\varvec{x}}^\top \hat{{\varvec{a}}}\) from the training data \(({\varvec{x}}_i,y_i)_{i=1}^n\) that satisfy the above conditions and evaluate the predictive error measured by the \(L^2\)-norm: \(\mathbb {E}_{{\varvec{x}}}[(f^*({\varvec{x}})-\hat{f}({\varvec{x}}))^2]\) where the expectation of x is taken over the unseen test data which is independent of the training data. Here, note that the predictive error can be expressed as follows:

$$\begin{aligned} \mathbb {E}_{{\varvec{x}}}\left[ \left( f^*({\varvec{x}})-\hat{f}({\varvec{x}})\right) ^2\right]= & {} \mathbb {E}_{{\varvec{x}}}\left[ \Vert {\varvec{x}}^\top ({\varvec{a}}^*-\hat{{\varvec{a}}})\Vert ^2\right] \nonumber \\= & {} \mathbb {E}_{{\varvec{x}}}\left[ ({\varvec{a}}^*-\hat{{\varvec{a}}})^\top {\varvec{x}}{\varvec{x}}^\top ({\varvec{a}}^*-\hat{{\varvec{a}}}) \right] \nonumber \\= & {} \Vert {\varvec{a}}^*-\hat{{\varvec{a}}}\Vert ^2_{\Sigma } \end{aligned}$$
(1)

2.2 Training procedure

In this study, we consider a fully-connected two-layer linear neural network with N neurons. The input and output of the network are represented as follows:

$$\begin{aligned} f({\varvec{x}}) = {\varvec{x}}^\top W {\varvec{a}}\quad (W\in \mathbb {R}^{d \times N},{\varvec{a}}\in \mathbb {R}^N).\nonumber \end{aligned}$$

Next, we describe how to train the parameters \(W,{\varvec{a}}\) of this model.

  1. 1.

    First, we randomly initialized the parameters \(W = W_0\) and \({\varvec{a}} = {\varvec{a}}_0\) as i.i.d. realization of Gaussian random variables: \(\sqrt{N} ~W_{ij}\overset{}{\sim }\mathcal {N}(0,1),~\sqrt{N} ~a_j \overset{}{\sim }\mathcal {N}(0,1)~(i \in [d],~j\in [N])\).

  2. 2.

    Second, the first-layer parameter \(W_0\) is updated by one-step gradient descent to reduce the training error \(\mathcal {L}=\frac{1}{n}\sum _{i=1}^n\left( y_i-f({\varvec{x}}_i)\right) ^2\). More precisely, it is updated as

    $$\begin{aligned} W_1 = W_0 - \eta P(\lambda ^{(1)}) \nabla _{W_0}\mathcal {L}.\nonumber \end{aligned}$$

    Here, \(P(\lambda ^{(1)})\) is a \(d\times d \) matrix that depends on the regularization parameter \(\lambda ^{(1)}\ge 0\), and \(\eta \ge 0\) represents the step size.

  3. 3.

    Third, \({\varvec{a}}_0\) is updated by performing the ridge regression with independent realization of the training data \((\tilde{{\varvec{x}}}_i,\tilde{y}_i)_{i=1}^n\) which have the same distribution as those used to train the first layer:

    $$\begin{aligned} {\varvec{a}}_1 = \text{ argmin}_{{\varvec{a}}\in \mathbb {R}^N}\, \left\{ \sum _{i=1}^n(\tilde{y}_i-\tilde{{\varvec{x}}}_i^\top W_1 {\varvec{a}})+\lambda ^{(2)}\Vert {\varvec{a}}\Vert ^2\right\} .\nonumber \end{aligned}$$

Here, we give some remark on the training procedure described above. In the following, we write \(X := ({\varvec{x}}_1,\dots ,{\varvec{x}}_n)^\top ,~\tilde{X} := (\tilde{{\varvec{x}}}_1,\dots ,\tilde{{\varvec{x}}}_n)^\top ,~{\varvec{y}} = (y_1,\dots ,y_n)^\top ,~\tilde{{\varvec{y}}}=(\tilde{y}_1,\dots ,\tilde{y}_n)^\top \). We first note that the matrix \(\nabla _{W_0}\mathcal {L}\) of the update equation of the first-layer is represented as

$$\begin{aligned} \nabla _{W_0}\mathcal {L} =-\frac{1}{n}X^\top \left( {\varvec{y}}-\frac{1}{\sqrt{N}}XW_0{\varvec{a}}\right) {\varvec{a}}^\top . \end{aligned}$$
(2)

This shows that the update from \(W_0\) to \(W_1\) is given by a rank-1 matrix. Therefore, we may write

$$\begin{aligned} W_1 = W_0 + \eta \tilde{{\varvec{a}}}{\varvec{a}}^\top \end{aligned}$$
(3)

where \(\tilde{a}\) is given by

$$\begin{aligned} \tilde{{\varvec{a}}} = P(\lambda ^{(1)})\frac{1}{n}X^\top \left( {\varvec{y}}-\frac{1}{\sqrt{N}}XW_0{\varvec{a}}\right) . \end{aligned}$$
(4)

In this study, we primarily analyze the situation where the number of neurons N is infinitely large while keeping dn finite. In this case, we can see that \(\tilde{{\varvec{a}}}\overset{a.s.}{\rightarrow }P(\lambda ^{(1)})\frac{1}{n}X^\top {\varvec{y}}~( N\rightarrow \infty )\) because the second term of (4) converges to a zero matrix.

Next, in this study, we analyze three choices of the matrix of \(P(\lambda ^{(1)})\):

$$\begin{aligned} P_1(\lambda ^{(1)})=I_d,~ P_2(\lambda ^{(1)})=(\Sigma +\lambda ^{(1)}I_d)^{-1},~ P_3(\lambda _1)=\left( \frac{1}{n}(X^\top X+\lambda ^{(1)}I_d)\right) ^{-1}. \end{aligned}$$

Here, \(\Sigma \) is not necessarily a non-singular matrix and \(X^\top X\) is not a non-singular matrix because \(X\in \mathbb {R}^{n\times d}\) and \(n<d\). In this case, if \(\lambda ^{(1)} = 0\), \((\Sigma +\lambda ^{(1)}I_d)^{-1}\) and \(\left( \frac{1}{n}(X^\top X+\lambda ^{(1)}I_d)\right) ^{-1}\) don’t exist. However, even in this case, \(\lim _{\lambda ^{(1)}\rightarrow 0}P_i(\lambda ^{(1)})\frac{1}{n}X^\top \left( {\varvec{y}}-\frac{1}{\sqrt{N}}XW_0{\varvec{a}}\right) ~(i=2,3)\) exists. So, we define \(P_i(0):=\lim _{\lambda ^{(1)\rightarrow 0}}P_i(\lambda ^{(1)})~(i=2,3)\) for convenience.

The following is an interpretation of each \(P(\lambda ^{(1)})\)

  1. 1.

    When \(P = P_1\), the update of W corresponds to just the vanilla gradient descent.

  2. 2.

    When we use \(P = P_2\), then the update of \(W_0\) can be seen as the mixture of the natural gradient descent (NGD) and the gradient descent (GD). Indeed, if \(\lambda ^{(1)} =0\), it is exactly NGD, and as \(\lambda ^{(1)}\) goes to \(\infty \), it approaches to GD. In that sense, \(\lambda ^{(1)}\) is a parameter that controls the mixture proportion of NGD and GD. The difference of NGD and GD is crucial especially when we consider the one-step update. Indeed, by noticing that \(\mathbb {E}_{X,\epsilon }[\frac{1}{n}X^\top X{\varvec{y}}]=\Sigma {\varvec{a}}^*\) since \(\mathbb {E}_\epsilon [{\varvec{y}}]=X{\varvec{a}}^*\), the update direction of NGD \((\lambda ^{(1)} = 0)\) is exactly same as the true signal \({\varvec{a}}^*\), which means that the NGD can efficiently capture the true signal in average. On the other hand, the inverse of the small eigenvalue of \(\Sigma \) is large, so the noise component in \(\frac{1}{n}X^\top X{\varvec{y}}\) is amplified by \(\Sigma ^{-1}\). Therefore \(\lambda ^{(1)}\) can be seen as a regularizer that uniformly terminates small eigenvalue components of \(\Sigma \) at the value of \(\lambda ^{(1)}\) and suppresses the amplification of noise.

  3. 3.

    Third, since

    $$\begin{aligned} \text{ argmin}_{\bar{{\varvec{a}}}\in \mathbb {R}^d} \{\Vert X\bar{{\varvec{a}}}-{\varvec{y}}\Vert ^2+\lambda ^{(1)}\Vert \bar{{\varvec{a}}}\Vert ^2\}= & {} \left( \frac{1}{n}(X^\top X+\lambda ^{(1)}I_d)\right) ^{-1}\frac{1}{n}X^\top {\varvec{y}} \nonumber \\= & {} X^\top (XX^\top +\lambda ^{(1)}I_n)^{-1}{\varvec{y}}, \end{aligned}$$
    (5)

    we can see that the direction of the update is a ridge regression where \(P(\lambda ^{(1)})=P_3(\lambda ^{(1)})\). The derivation of (5) is described in Sect. 6.

Next, we discuss the optimization of the second layer. This is a one-layer linear ridge regression on the data \(XW_1\) passing through the first-layer. However, as we have already seen, \(W_1\) and X are not independent. In our analysis, we then optimize the second layer \({\varvec{a}}_1\) based on the same size data \(\tilde{X}\) sampled again independently of X and suppress the stochastic behavior of X and \(\tilde{X}\) in two parts. This is an approximate setup for the analysis, but the qualitative behavior is consistent with that of actual models that are learned with unsplit data (Fig. 11). In a situation where data are split, as in (5), \({\varvec{a}}_1\) can be written explicitly with \(W_1\) and \(\tilde{X}\):

$$\begin{aligned} {\varvec{a}}_1 = W_1^\top \tilde{X}^\top (\tilde{X}W_1 W_1^\top \tilde{X}^\top +\lambda ^{(2)} I_n)^{-1}\tilde{{\varvec{y}}}. \nonumber \end{aligned}$$

Therefore, the estimator of the two-layer linear model in this study is given by \(\hat{{\varvec{a}}}=W_1{\varvec{a}}_1 = W_1 W_1^\top \tilde{X}^\top (\tilde{X}W_1 W_1^\top \tilde{X}^\top +\lambda ^{(2)} I_n)^{-1}\tilde{{\varvec{y}}}\). This estimator \(\hat{{\varvec{a}}}\) can be regarded as a kind of preconditioned estimators analyzed in Amari et al. [1]. In general, preconditioned estimators is given by

$$\begin{aligned} PX^\top (XPX^\top )^{-1}{\varvec{y}} \end{aligned}$$
(6)

with a matrix P. Therefore \(\hat{{\varvec{a}}}\) is a variant of a preconditioned estimator where \(P=W_1 W_1^\top \) in (6). In particular, since \(W_0 W_0^\top \overset{a.s.}{\rightarrow }I_d~( N/d \rightarrow \infty )\), the random feature model is consistent with a one-layer linear regression model when N is sufficiently large (Fig. 9). As we have mentioned above, we consider a setting where N is taken to be \(\infty \) so that \(W_0W_0^\top =I_d\) because our goal is to compare the respective predictive errors of the estimator obtained by the one-step feature learning model and one obtained by the one-layer linear regression model.

3 Main results

In this section, we first give the Bias-Variance decomposition of the predictive error, and then derive upper-bounds on the Bias and Variance for the general step size. We next explicitly compare the predictive errors between a feature learning method and a non-feature learning method in two different regimes of the step size: small step size setting and large step size setting.

The predictive error given in (1) is decomposed into Bias and Variance as

$$\begin{aligned} \Vert {\varvec{a}}^{*}-\hat{{\varvec{a}}}\Vert ^{2}_\Sigma= & {} \Vert {\varvec{a}}^{*} -W_{1} W_{1}^{\top } \tilde{X}^{\top } (\tilde{X}W_{1} W_{1}^{\top } \tilde{X}^{\top } +\lambda ^{(2)} I_{n})^{-1}\tilde{{\varvec{y}}} \Vert ^{2}_\Sigma \nonumber \\\lesssim & {} \Vert {\varvec{a}}^{*} -W_{1} W_{1}^{\top } \tilde{X}^{\top } (\tilde{X}W_{1} W_{1}^{\top } \tilde{X}^{\top } +\lambda ^{(2)} I_{n})^{-1}\tilde{X}{\varvec{a}}^{*} \Vert ^{2}_\Sigma \nonumber \\{} & {} + \Vert W_{1} W_{1}^{\top } \tilde{X}^{\top } (\tilde{X}W_{1} W_{1}^{\top } \tilde{X}^{\top } +\lambda ^{(2)} I_{n})^{-1}\tilde{{\varvec{\epsilon }}} \Vert ^{2}_{\Sigma } \nonumber \\=: & {} \mathscr {B} + \mathscr {V}. \end{aligned}$$
(7)

The following theorem enables us to reduce evaluation of the Bias and Variance of the two layer neural network to that of the vanilla (one-layer) ridge regression.

Theorem 1

Denote the right singular vectors of \(\Sigma ^\frac{1}{2}W_1\) by \(\tilde{V}\in \mathbb {R}^{N\times N}\). Here, \(\Lambda := \tilde{V}^\top W_1^\top \Sigma W_1\tilde{V}\) is a diagonal matrix such that \(\Lambda =\text{ diag }(\tilde{\lambda }_1,\dots ,\tilde{\lambda }_N)\) with \(\tilde{\lambda }_1\ge \tilde{\lambda }_2\ge \cdots \ge \tilde{\lambda }_N\). Then, it holds that

$$\begin{aligned} \mathscr {B}= & {} \left| \left| \left( I_N -\tilde{V}^\top W_1^\top \tilde{X}^\top (\tilde{X}W_1 \tilde{V}\tilde{V}^\top W_1^\top \tilde{X}^\top +\lambda ^{(2)} I_n)^{-1} \tilde{X}W_1 \tilde{V} \right) \tilde{V}^\top W_1^\star {\varvec{a^*}}\right| \right| ^2_\Lambda ,\nonumber \\ \mathscr {V}= & {} \left| \left| \tilde{V} ^\top W_1^\top \tilde{X}^\top (\tilde{X}W_1\tilde{V}\tilde{V}^\top W_1^\top \tilde{X}^\top +\lambda ^{(2)} I_n)^{-1}\tilde{{\varvec{\epsilon }}} \right| \right| ^2_\Lambda .\nonumber \end{aligned}$$

Here, let \(W_1^\star \) be any matrix satisfying \(W_1W_1^\star =I_d\). Since \(W_1\) is row full rank, we can easily check that \(W_1^\star \) exists. Indeed, \(W_1^\dagger \) satisfies \(W_1W_1^\dagger =I_d\).

Remark 1

Since \(W_1\in \mathbb {R}^{d\times N}\) and \(N>d\), \(\tilde{\lambda }_{d+1}=\cdots =\tilde{\lambda }_N=0\).

Once the first layer’s parameter is fixed, the predictive error of the two-layer network can be evaluated through the usual analysis on the one-layer network. The following theorem gives upper bounds of the Bias and Variance of the two-layer network through the analysis on the usual ridge regression in the interpolation regime [19]. With this theorem, we can evaluate the effect of feature learning on the predictive error. From \(\mathbb {E}[\tilde{V}^\top W_1^\top {\varvec{x}}{\varvec{x}}^\top W_1\tilde{V}] = \tilde{V}^\top W_1^\top \Sigma W_1\tilde{V}\), the proof is obtained immediately by applying Theorem 1 of Tsigler and Bartlett [19] to Theorem 1.

Theorem 2

Suppose \(\lambda ^{(2)} \ge 0\). There exist constants \(c_x\) and \(C_x\) depending only on \(\sigma _x\) such that, for any \(\delta <1-4e^{-n/c_x^2}\), (1) the condition number of \(XW_1V_{k:\infty }V_{k:\infty }^\top W_1^\top X^\top +\lambda ^{(2)}I_N\) is at most L with probability \(1-\delta \) where L is a constant such that

$$\begin{aligned} \frac{1}{c_xL} \le \frac{\lambda ^{(2)}+\sum _{i>k}\tilde{\lambda }_i}{n\tilde{\lambda }_{k+1}}, \nonumber \end{aligned}$$

and (2) for any \(k \le n/c_x,~t \in (1,n/c_x)\) , we have

$$\begin{aligned} \frac{\mathscr {B}}{C_x L^4}\le & {} \Vert \tilde{V}_{0:k}^\top W_1^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{0:k}^{-1}} \left( \frac{\lambda ^{(2)}+\sum _{i>k}\tilde{\lambda }_i}{n}\right) ^2 +\Vert \tilde{V}_{k:\infty }^\top W_1^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{k:\infty }} =:\bar{B}_1,\nonumber \\ \frac{\mathscr {V}}{C_xL^2\sigma _\epsilon ^2 t}\le & {} \frac{k}{n}+\frac{n\sum _{i>k}\tilde{\lambda }_i^2}{(\lambda ^{(2)}+\sum _{i>k}\tilde{\lambda _i})^2} =: \bar{V}_1. \nonumber \end{aligned}$$

with probability \(1-\delta -20e^{-t/c_x}\).

Here, from the note on the random feature model in Sect. 2, the evaluation of the Bias and Variance with \(\eta =0\) follows that of a one-layer ridge regression, namely, the following corollary holds.

Corollary 1

Letting \(\eta =0\) under the same conditions as in Theorem 2, there exists \(N_0\in \mathbb {Z}\) such that, for any \(N\ge N_0\), the following inequalities hold:

$$\begin{aligned} \frac{\mathscr {B}}{C_x L^4}\le & {} \Vert {\varvec{a}}^*_{0:k}\Vert ^2_{\Sigma _{0:k}^{-1}} \left( \frac{\lambda ^{(2)}+\sum _{i>k}{\lambda }_i}{n}\right) ^2 +\Vert {\varvec{a}}^*_{k:\infty }\Vert ^2_{\Sigma _{k:\infty }} =:\bar{B}_0,\nonumber \\ \frac{\mathscr {V}}{C_xL^2\sigma _\epsilon ^2 t}\le & {} \frac{k}{n}+\frac{n\sum _{i>k}{\lambda }_i^2}{(\lambda ^{(2)}+\sum _{i>k}{\lambda _i})^2} =: \bar{V}_0. \nonumber \end{aligned}$$

Remark 2

\(\bar{B}_0\) and \(\bar{V}_0\) given in Corollary 1 match the evaluation of Bias and Variance in Theorem 1 of Tsigler and Bartlett [19].

We then present some additional lemmas on Theorem 2.

Lemma 1

The second term of \(\bar{B}_1\) is represented as follows:

$$\begin{aligned} \Vert \tilde{V}_{k:\infty }^\top W_1^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{k:\infty }} = \Vert {\varvec{a}}^*\Vert ^2_{\Sigma ^{\frac{1}{2}}\tilde{U}_{k:\infty }\tilde{U}_{k:\infty }^\top \Sigma ^{\frac{1}{2}}}, \nonumber \end{aligned}$$

where, \(\tilde{U}\in \mathbb {R}^{d\times d}\) are the left singular vectors of \(\Sigma ^{\frac{1}{2}}W_1\).

Remark 3

The value of \(\bar{B}_1\) does not depend on the choice of \(W_1^*\).

3.1 Error analysis of sufficiently small step feature learning

In this subsection, we give a first-order perturbation of the predictive error by considering a small step size regime: \(\eta 1\). We first note that the step size need to be appropriately reduced as N increases to performe perturbation analysis.

Now, for given \(\tilde{{\varvec{a}}}\), \(W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +\eta (W_0 {\varvec{a}}\tilde{{\varvec{a}}}^\top +\tilde{{\varvec{a}}}(W_0 {\varvec{a}})^\top )+(W_0W_0^\top -I_d)\) holds from (3). As \(W_0W_0^\top =I_d+\mathcal {O}_\mathbb {P}(\sqrt{d/N})\) and \(\Vert W_0 {\varvec{a}}\Vert =\mathcal {O}_\mathbb {P}(\sqrt{d/N})\), when \(\eta = N^\gamma \,(\gamma \in (-1/2,0))\) and N is sufficiently large, \(W_1 W_1^\top \) is represented as \(W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +o_\mathbb {P}(\eta ^2)\). The singular values and singular vectors of \(\Sigma ^{\frac{1}{2}}W_1\) can be evaluated from the expansion. Under this assumption, we give the predictive error after feature learning below. Let \(\eta ^2=\varepsilon \).

Theorem 3

Let \(\bar{B}_1\) and \(\bar{V}_1\) be those given in Theorem 2, and let k and \(\lambda ^{(2)}\) satisfy the same assumptions as in Theorem 2. Furthermore, let \(W_1 W_1^\top =I_d+\varepsilon \tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +o_\mathbb {P}(\varepsilon )\) and assume that the eigenvalues of \(\Sigma \) are not degenerate. Then, \(\bar{B}_1\) and \(\bar{V}_1\) are represented as follows:

$$\begin{aligned} \bar{B}_1= & {} \Vert V_{0:k}^\top W_1^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{0:k}^{-1}} \left( \frac{\lambda ^{(2)}+\sum _{i>k}\tilde{\lambda }_i}{n}\right) ^2 +\Vert V_{k:\infty }^\top W_1^\star {\varvec{a}}^*\Vert ^2_{\Lambda _{k:\infty }} \nonumber \\= & {} \Vert {\varvec{a}}^*_{0:k}\Vert ^2_{\Sigma _{0:k}^{-1}}\left( \frac{\lambda ^{(2)} +\sum _{i>k}{\lambda }_i}{n}\right) ^2 \nonumber \\{} & {} +2\varepsilon \Vert {\varvec{a}}^*_{0:k}\Vert ^2_{\Sigma _{0:k}^{-1}}\frac{\Vert \tilde{{\varvec{a}}}_{k:\infty }\Vert ^2_{\Sigma _{k:\infty }}}{n} \nonumber \left( \frac{\lambda ^{(2)}+\sum _{i>k}{\lambda }_i}{n}\right) \nonumber \\{} & {} - 2\varepsilon \left( \sum _{i=1}^k \tilde{{\varvec{a}}}_i^2 {{\varvec{a}}^*_i}^2\lambda _i^{-1} \right) \left( \frac{\lambda ^{(2)} +\sum _{i>k}{\lambda }_i}{n}\right) ^2\nonumber \\{} & {} -2\varepsilon \left( \sum _{j=1}^d\sum _{i=1}^{\min (k,j-1)}\frac{\lambda _i^{-\frac{3}{2}}\lambda _j^{\frac{1}{2}}}{\lambda _i-\lambda _j}\tilde{{\varvec{a}}_i}\tilde{{\varvec{a}}}_j{\varvec{a}}^*_i{\varvec{a}}^*_j - \sum _{j=1}^{k-1}\sum _{i=j+1}^{k} \frac{\lambda _i^{-\frac{3}{2}}\lambda _j^{\frac{1}{2}}}{\lambda _j-\lambda _i}\tilde{{\varvec{a}}_i}\tilde{{\varvec{a}}}_j{\varvec{a}}^*_i{\varvec{a}}^*_j \right) \nonumber \\{} & {} +\Vert {\varvec{a}}^*_{k:\infty }\Vert _{\Sigma _{k:\infty }} \nonumber \\{} & {} +2\varepsilon \sum _{i=1}^k\sum _{j=k+1}^d\frac{\lambda _i^\frac{1}{2}\lambda _j^\frac{1}{2}}{\lambda _i-\lambda _j}\tilde{{\varvec{a}}}_i\tilde{{\varvec{a}}}_j{\varvec{a}}^*_i{\varvec{a}}^*_j+o_\mathbb {P}(\varepsilon ), \nonumber \\ \bar{V}_1= & {} \frac{k}{n} + \frac{n\sum _{i>k}\tilde{\lambda _i}^2}{(\lambda ^{(2)}+\sum _{i>k}\tilde{\lambda }_i)^2} \nonumber \\= & {} \frac{k}{n}+\frac{n\sum _{i>k}\lambda _i^2}{(\lambda ^{(2)}+\sum _{i>k}\lambda _i)^2}\left( 1+2\varepsilon \frac{\sum _{i>k}\lambda _i^2\tilde{{\varvec{a}}}_i^2}{\sum _{i>k}\lambda _i^2} -2\varepsilon \frac{\sum _{i>k}\lambda _i \tilde{{\varvec{a}}}_i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}\right) +o_\mathbb {P}(\varepsilon ). \nonumber \end{aligned}$$

Here, \({\varvec{a}}^2_i:=({\varvec{a}}_i)^2\) for any vector \({\varvec{a}}\).

Let each term in the right hand side of the bias \(\bar{B}_1\) in Theorem 3 be \(b_1\) to \(b_6\) respectively, that is,

$$\begin{aligned} b_1= & {} \Vert {\varvec{a}}^*_{0:k}\Vert ^2_{\Sigma _{0:k}^{-1}}\left( \frac{\lambda ^{(2)} +\sum _{i>k}{\lambda }_i}{n}\right) ^2,~ b_2=2\varepsilon \Vert {\varvec{a}}^*_{0:k}\Vert ^2_{\Sigma _{0:k}^{-1}}\frac{\Vert \tilde{{\varvec{a}}}_{k:\infty }\Vert ^2_{\Sigma _{k:\infty }}}{n} \nonumber \left( \frac{\lambda ^{(2)}+\sum _{i>k}{\lambda }_i}{n}\right) ,~\nonumber \\ b_3= & {} -2\varepsilon \left( \sum _{i=1}^k \tilde{{\varvec{a}}}_i^2 {{\varvec{a}}^*_i}^2\lambda _i^{-1} \right) \left( \frac{\lambda ^{(2)} +\sum _{i>k}{\lambda }_i}{n}\right) ^2,\nonumber \\ ~b_4= & {} -2\varepsilon \left( \sum _{j=1}^d\sum _{i=1}^{\min (k,j-1)}\frac{\lambda _i^{-\frac{3}{2}}\lambda _j^{\frac{1}{2}}}{\lambda _i-\lambda _j}\tilde{{\varvec{a}}_i}\tilde{{\varvec{a}}}_j{\varvec{a}}^*_i{\varvec{a}}^*_j - \sum _{j=1}^{k-1}\sum _{i=j+1}^{k} \frac{\lambda _i^{-\frac{3}{2}}\lambda _j^{\frac{1}{2}}}{\lambda _j-\lambda _i}\tilde{{\varvec{a}}_i}\tilde{{\varvec{a}}}_j{\varvec{a}}^*_i{\varvec{a}}^*_j \right) ,~\nonumber \\ b_5= & {} \Vert {\varvec{a}}^*_{k:\infty }\Vert _{\Sigma _{k:\infty }},~ b_6=2\varepsilon \sum _{i=1}^k\sum _{j=k+1}^d\frac{\lambda _i^\frac{1}{2}\lambda _j^\frac{1}{2}}{\lambda _i-\lambda _j}\tilde{{\varvec{a}}}_i\tilde{{\varvec{a}}}_j{\varvec{a}}^*_i{\varvec{a}}^*_j. \nonumber \end{aligned}$$

Each term has the following meaning. First, \(b_1\) and \(b_5\) are the same terms as \(\bar{B}_0\). Then, \(b_2\) and \(b_3\) are the terms that mainly increase or decrease \(\bar{B}_1\) from \(\bar{B}_0\) because \(b_4\) and \(b_6\) are zeros in the setting of sparse signal. Next, we give a corollary below to analyze the effect of \(b_4\) and\(~b_6\) in detail.

Corollary 2

Let k satisfy the same assumptions as in Theorem 2 and let \(\tilde{{\varvec{a}}}=P^{-1}X^\top {\varvec{y}}/n\) with a diagonal matrix \(P=\text{ diag }(p_1,\dots ,p_d)\). Then, the following equations hold:

$$\begin{aligned} \mathbb {E}_{X,\epsilon }[b_4]= & {} -4\varepsilon \left( \sum _{j=1}^d\sum _{i=1}^{\min (k,j-1)}\frac{\lambda _i^{-\frac{3}{2}}\lambda _j^\frac{1}{2}}{\lambda _i-\lambda _j}\frac{\lambda _i}{p_i}{{\varvec{a}}^*_i}^2\frac{\lambda _j}{p_j}{{\varvec{a}}^*_j}^2-\sum _{j=1}^{k-1}\sum _{i=j+1}^k\frac{\lambda _i^{-\frac{3}{2}}\lambda _j^\frac{1}{2}}{\lambda _j-\lambda _i}\frac{\lambda _i}{p_i}{{\varvec{a}}^*_i}^2\frac{\lambda _j}{p_j}{{\varvec{a}}^*_j}^2\right) ,\nonumber \\ \mathbb {E}_{X,\epsilon }[b_6]= & {} 4\varepsilon \sum _{i=1}^k\sum _{j=k+1}^d\frac{\lambda _i^\frac{1}{2}\lambda _j^\frac{1}{2}}{\lambda _i-\lambda _j}\frac{\lambda _i}{p_i}{{\varvec{a}}^*_i}^2\frac{\lambda _j}{p_j}{{\varvec{a}}^*_j}^2.\nonumber \end{aligned}$$

Remember that \(P = \Sigma \) and \(P= I_d\) correspond to the NGD and GD update respectively.

From Corollary 2, we have the following observations. First, \(b_4\) represents the interaction among the first k components of \({\varvec{a}}^*\). In the summand, the terms with smaller \(\lambda _i - \lambda _j\) have greater impact on the whole bias because \(\lambda _i - \lambda _j\) appears in the denominator in each term. Since the first term is always negative, it makes \(\bar{B}_1\) smaller on average, while the second term makes \(\bar{B}_1\) larger on average. \(b_6\) represents the interaction between the first k components of \({\varvec{a}}^*\) and the rest components of \({\varvec{a}}^*\). Because \(\mathbb {E}_{X,\epsilon }[b_6]\) is always positive, \(b_6\) makes \(\bar{B}_1\) larger on average. However, its effect is relatively small because \(\lambda _i - \lambda _j\) in the denominator tends to be large as they are from different components; i.e., i is from the first k components and j is from the rest.

Sparse signal In practice, signals are often sparse under high dimensional inputs, namely, most components of the \({\varvec{a}}^*\) are zero. Then, in the following, we derive a more detailed evaluation of Theorem 3 where \({\varvec{a}}^*\) is completely sparse, that is, \({\varvec{a}}^*_{k^*}=1/\sqrt{\lambda _{k^*}},~{\varvec{a}}^*_i=0~(\forall i\ne k^*)\) for \(k^*\in [d]\). In this case, since \(b_4=b_6=0\), we obtain the following corollaries where each term in the bias is due to \(b_2\) and \(b_3\).

Corollary 3

Let \(\bar{B}_1\) and \(\bar{V}_1\) be given in Theorem 3 and let k and \(\lambda ^{(2)}\) satisfy the same assumptions in Theorem 2. If \({\varvec{a}}^*_i = \lambda _i^{-1/2} ~(i=k^*),~{\varvec{a}}^*_i = 0 ~(i\ne k^*)\), for \(k^*\le k\) and \(\tilde{{\varvec{a}}} = (\Sigma +\lambda ^{(1)}I_d)^{-1}X^\top {\varvec{y}}/n\), the following equations hold:

$$\begin{aligned} \mathbb {E}_{X,\epsilon }[\bar{B}_1]{} & {} = \bar{B}_0\left\{ 1-2\varepsilon \left[ \frac{1}{\lambda _{k^*}}-\frac{1+\sigma _\epsilon ^2}{n}\frac{\sum _{i>k}\left( \frac{\lambda _i}{\lambda _i+\lambda ^{(1)}}\right) ^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i} \right] \right\} +o_\mathbb {P}(\varepsilon ),\\ \mathbb {E}_{X,\epsilon }[\bar{V}_1]{} & {} = \bar{V}_0 -2\varepsilon \frac{n\sum _{i>k}\lambda _i^2}{(\lambda ^{(2)}+\sum _{i>k}\lambda _i)^2}(1+\sigma ^2_\epsilon )\\{} & {} \quad \ \times \left( \frac{1}{n}\frac{\sum _{i>k}\left( \frac{\lambda _i}{\lambda _i+\lambda ^{(1)}}\right) ^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i} - \frac{1}{n}\frac{\sum _{i>k}\lambda _i\left( \frac{\lambda _i}{\lambda _i+\lambda ^{(1)}}\right) ^2}{\sum _{i>k}\lambda _i^2}\right) +o_\mathbb {P}(\varepsilon ). \end{aligned}$$

In particular, if \(\lambda ^{(1)}=0\), we have following evaluations:

$$\begin{aligned} \mathbb {E}_{X,\epsilon }[\bar{B}_1]{} & {} = \bar{B}_0\left[ 1-2\varepsilon \left( \frac{1}{\lambda _{k^*}}-\frac{d-k}{n}\frac{1+\sigma _\epsilon ^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}\right) \right] +o_\mathbb {P}(\varepsilon ), \\ \mathbb {E}_{X,\epsilon }[\bar{V}_1]{} & {} = \bar{V}_0 -2\varepsilon \frac{n\sum _{i>k}\lambda _i^2}{(\lambda ^{(2)}+\sum _{i>k}\lambda _i)^2}(1+\sigma _\epsilon ^2)\left( \frac{d-k}{n}\frac{1}{\lambda ^{(2)}+\sum _{i>k}\lambda _i} - \frac{1}{n}\frac{\sum _{i>k}\lambda _i}{\sum _{i>k}\lambda _i^2}\right) \\{} & {} \quad \ +o_\mathbb {P}(\varepsilon ). \end{aligned}$$

Corollary 3 shows how NGD affects the bias through feature learning. On the other hand, the following corollary gives the effect of GD.

Corollary 4

Let \(\bar{B}_1\) and \(\bar{V}_1\) be given in Theorem 3 and let k,  and \(\lambda ^{(2)}\) satisfy the same assumptions in Theorem 2. If \({\varvec{a}}^*_i = \lambda _i^{-1/2} ~(i=k^*),~{\varvec{a}}^*_i = 0 ~(i\ne k^*)\), for \(k^*\le k\) and \(\tilde{{\varvec{a}}} = X^\top {\varvec{y}}/n\), the following equations hold:

$$\begin{aligned} \mathbb {E}_{X,\epsilon }[\bar{B}_1]= & {} \bar{B}_0\left[ 1-2\varepsilon \left( \lambda _{k^*}-\frac{1+\sigma _\epsilon ^2}{n}\frac{\sum _{i>k}\lambda _i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}\right) \right] +o_\mathbb {P}(\varepsilon ),\nonumber \\ \mathbb {E}_{X,\epsilon }[\bar{V}_1]= & {} \bar{V}_0 -2\varepsilon \frac{n\sum _{i>k}\lambda _i^2}{(\lambda ^{(2)}+\sum _{i>k}\lambda _i)^2}(1+\sigma _\epsilon ^2)\left( \frac{1}{n}\frac{\sum _{i>k}\lambda _i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i} - \frac{1}{n}\frac{\sum _{i>k}\lambda _i^3}{\sum _{i>k}\lambda _i^2}\right) \nonumber \\{} & {} +o_\mathbb {P}(\varepsilon ). \nonumber \end{aligned}$$

In the following, we discuss in more detail the effect of feature learning with NGD and with GD.

For \(\tilde{{\varvec{a}}}=\Sigma ^{-1}X^\top {\varvec{y}}/n\), while \(\tilde{{\varvec{a}}}\) is an unbiased estimator of \({\varvec{a}}^*\) (\(\mathbb {E}_{X,{\varvec{\epsilon }}}[\tilde{{\varvec{a}}}]={\varvec{a}}^*\)), its variance is far away from zero. Indeed, \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert _\Sigma ^2]\) is approximately \(\frac{d(1+\sigma ^2_\epsilon )}{n}\) from (11). Therefore, in the case where d is larger than n, the estimation accuracy of the first-layer is quite poor. However, Corollary 3 suggests that feature learning can improve predictive error even in such a case. In fact, the following corollary holds for the situation where the bias is reduced.

Corollary 5

Let k,  and \(\lambda ^{(2)}\) satisfy the same assumptions in Theorem 2 and let \(\lambda _i=i^{-(1+\alpha )}~(-1<\alpha ,~\alpha \ne 0)\). Then, the following holds:

$$\begin{aligned} \frac{1}{\lambda _{k^*}}-\frac{d-k}{n}\frac{1+\sigma ^2_\epsilon }{\lambda ^{(2)}+\sum _{i>k}\lambda _i} > {k^*}^{1+\alpha }-\frac{d}{n}\frac{1+\sigma ^2_\epsilon }{\lambda ^{(2)}+\frac{1}{\alpha }\left[ (k+1)^{-\alpha }-d^{-\alpha }\right] }.\nonumber \end{aligned}$$

Especially, letting \(k=o(d),~\lambda ^{(2)}=0\), and \(-1<\alpha <0\), we have that

$$\begin{aligned} \frac{1}{\lambda _{k^*}}-\frac{d-k}{n}\frac{1+\sigma ^2_\epsilon }{\lambda ^{(2)}+\sum _{i>k}\lambda _i} >{k^*}^{1+\alpha }+\frac{\alpha (1+\sigma ^2_\epsilon )}{n}d^{1+\alpha }. \nonumber \end{aligned}$$

Hence, if \({k^*}^{1+\alpha }>\frac{|\alpha |(1+\sigma ^2_\epsilon )}{n}d^{1+\alpha }\), then it holds that

$$\begin{aligned} \frac{1}{\lambda _{k^*}}-\frac{d-k}{n}\frac{1+\sigma ^2_\epsilon }{\lambda ^{(2)}+\sum _{i>k}\lambda _i} >0,\nonumber \end{aligned}$$

and moreover \(\bar{B}_1<\bar{B}_0\) holds.

Corollary 5 tells that the threshold of \(k^*\) yielding better generalization by feature learning is \({k^*}^{1+\alpha }=\frac{|\alpha |(1+\sigma ^2_\epsilon )}{n}d^{1+\alpha }\) if \(\alpha <0\). Especially if \(d=o(n^\frac{1}{1+\alpha })\), the improvement of the bias is \({k^*}^{1+\alpha }\). On the other hand, if d is larger than n and the eigenvalue decay rate is fast, feature learning worsens bias though one-layer linear regression may generalize even in the case where \(d=\infty \) and \(n<\infty \) from Bartlett et al. [4]. In this regime, it is shown that ridge regression exhibits good predictive performance.

For \(\tilde{{\varvec{a}}}=X^\top {\varvec{y}}/n\), decrease in the bias \(\lambda _{k^*}-\frac{1+\sigma _\epsilon ^2}{n}\frac{\sum _{i>k}\lambda _i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}<\lambda _{k^*}={k^*}^{-(1+\alpha )}\) in the same case as Corollary 5. Therefore, as \(k^*\rightarrow \infty \), the improvement of bias converges to zero.

Therefore, when \(k^*\) is large and exeeds the threshold and the eigenvalue decay rate is slow, feature learning with NGD improves the bias while feature learning with GD does not, and then NGD is superior to GD.

3.2 Error analysis of sufficiently large step feature learning

In this subsection, we give an evaluation of the predictive error when the step size \(\eta \) is sufficiently large.

Ba et al. [3] shows that, when \(\eta =\mathcal {O}(N)\), one-step gradient feature learning can outperform the kernel method as \(n,d,N\rightarrow \infty \). It is also reported that, if \(\eta =\mathcal {O}(1)\), feature learning does not improve the predictive error, and \(\eta =\omega ({\sqrt{N}})\) is still not enough.

In this study, we analyze the predictive error in the case where N is sufficiently large while n and d are finite. Here, we assume that the step size \(\eta \) is within the range of \(N^\gamma ~(\gamma \in (0,\frac{1}{2}))\) for the following reasons. As we have seen in Sect. 3.1, \(W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +\eta (W_0 {\varvec{a}}\tilde{{\varvec{a}}}^\top +\tilde{{\varvec{a}}}(W_0 {\varvec{a}})^\top )\) and \(\eta (W_0 {\varvec{a}}\tilde{{\varvec{a}}}^\top +\tilde{{\varvec{a}}}(W_0 {\varvec{a}})^\top )=\mathcal {O}_\mathbb {P}(\eta \sqrt{\frac{d}{N}})\). Therefore, when \(\eta =N^\gamma ~(\gamma \in (0,\frac{1}{2}))\), \(W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +\mathcal {O}_\mathbb {P}(\sqrt{d}\eta ^{1-1/2\gamma }) \). Then, for large \(\eta \), \(W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +o_\mathbb {P}(1) \) and following theorem holds.

We let \(\tilde{U}=\lim _{\eta \rightarrow \infty }\tilde{U}(\eta )\) where \(\tilde{U}(\eta )\) is the eigenvectors of \(\Sigma ^\frac{1}{2}W_1W_1^\top \Sigma ^\frac{1}{2}=\Sigma +\eta ^2\Sigma ^\frac{1}{2}\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top \Sigma ^\frac{1}{2}\). Now we can easily check that the limit exists and \(\tilde{U}\) is an orthogonal matrix. Here, note that the first eigenvector of \(\tilde{U}\) is \(\Sigma ^{\frac{1}{2}} \tilde{{\varvec{a}}}\) and the others are vectors orthogonal to \(\Sigma ^{\frac{1}{2}} \tilde{{\varvec{a}}}\).

Theorem 4

Let \(\bar{B}_1\) and \(\bar{V}_1\) be given in Theorem 2 and let k and \(~\lambda ^{(2)}\) satisfy the same assumption as Theorem 2. Then, letting \({\varvec{b}}=\tilde{{\varvec{a}}}-(\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*){\varvec{a}}^*\) and if \( W_1 W_1^\top =I_d+\eta ^2\tilde{{\varvec{a}}}\tilde{{\varvec{a}}}^\top +o_\mathbb {P}(1)\), the following holds.

$$\begin{aligned} \bar{B}_1\le & {} \left\| \tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2_{\Sigma _{1:k}^{-2}}\left( \frac{\lambda ^{(2)}+\lambda _1+\sum _{i>k}\lambda _i}{n}\right) ^2 +\left\| \tilde{U}_{k:\infty }^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2+o_\mathbb {P}(1).\nonumber \end{aligned}$$

As for the variance, since \(\sum _{i>k}\lambda _i\le \sum _{i>k}\tilde{\lambda }_i\le \sum _{i>k}\lambda _i+\lambda _1\) in (13), it holds that

$$\begin{aligned} \bar{V}_1 \le \frac{k}{n}+\frac{n\sum _{i>k}\tilde{\lambda }_i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i+\lambda _1}+o_\mathbb {P}(1) \end{aligned}$$

from Theorem 2. It is expected that \(\sum _{i>k}\tilde{\lambda }_i^2\) also approximately equals to \(\sum _{i>k}\lambda _i^2\).

In Theorem 4, if \(\tilde{{\varvec{a}}}\) approximates \({\varvec{a}}^*\), \(\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*\) approximately equals to 1. Then, \(\Vert {\varvec{b}}\Vert ^2_\Sigma \) matches \(\Vert {\varvec{a}}^*-\tilde{{\varvec{a}}}\Vert ^2_\Sigma \). Therefore, Theorem 4 can be regarded as a formula evaluating the approximation error of the first-layer estimator \(\tilde{{\varvec{a}}}\) with \({\varvec{a}}^*\) by dividing it by \(\tilde{U}\) into the first k components and the rest of them. However it is difficult to explicitly characterize the distribution of \(\tilde{U}\). Hence, we consider two concrete situations in the following so that we explicitly evaluate Theorem 4.

(1) The first one is the case where the distribution of \(\tilde{U}\) is the worst. Since \(\tilde{U}\) is an orthogonal matrix, it holds that \(\left\| \tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2_{\Sigma _{1:k}^{-2}}\le \frac{\lambda _k^{-2}\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\). Here, even in the worst case setting, that is, \(\left\| \tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2_{\Sigma _{1:k}^{-2}}= \frac{\lambda _k^{-2}\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\), following evaluation holds.

Corollary 6

Let \(\mathscr {B}\) be given in (7). There exists a large constant \(c_x\) that only depends on \(\sigma _x\) such that

  1. 1.

    If there exists a constant \(C_x\) that only depends on \(\sigma _x\) and \(n\lambda _{k+1}\ge C_x\sum _{i>k}\lambda _i\) holds for some \(k<n/c_x\), then by letting \(\lambda ^{(2)}=n\tilde{\lambda }_{k+1}\) in Theorem 4, we have that for any \(t\in (c_x,n/c_x)\)

    $$\begin{aligned} \mathscr {B} \le D_x\frac{\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*} \nonumber \end{aligned}$$

    with probability at least \(1-26e^{-t/c_x}\). Here, \(D_x\) is a constant that only depends on \(\sigma _x\).

  2. 2.

    If \(\sum _{i>k}\lambda _i\ge c_xn\lambda _{k+1}\) holds for some \(k<n/c_x\), then by letting \(\lambda ^{(2)}=-\sum _{i>k}\lambda _i+\xi \left( n\lambda _1+\sqrt{n\sum _{i>k}\lambda _i^2}\right) \) for some constant \(\xi >c_x\), we have that for any \(t\in (c_x,n/c_x)\)

    $$\begin{aligned} \mathscr {B} \le D_x\left( 1+\frac{\sum _{i>k}\lambda _i^2}{\lambda _k^2}\right) \frac{\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\nonumber \end{aligned}$$

    with probability at least \(1-20e^{-t/c_x}\). Especially, if \(\lambda _i=i^{-(1+\alpha )}~(-0.5<\alpha \le 0)\), we have \(\mathscr {B} \le D_x\frac{\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\).

Corollary 6 shows that \(\bar{B}_1\) is bounded by a constant multiple of the estimation error in the first layer even in the case where \(\left\| \tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2_{\Sigma _{1:k}^{-2}}= \frac{\lambda _k^{-2}\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\). Especially, for the learning model with \(P_3(\lambda _1)=\left( \frac{1}{n}(X^\top X+\lambda ^{(1)}I_d)\right) ^{-1}\) in the first-layer update, the predictive error for the entire second layer is at most a constant multiple of the one-layer ridge regression model. Here, \(\left\| \tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2_{\Sigma _{1:k}^{-2}}= \frac{\lambda _k^{-2}\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\) holds when \(\tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}{\varvec{b}}\propto {\varvec{e}}_k\). However, since \(\tilde{U}\) is left singular vectors of the matrix which is \(\Sigma ^\frac{1}{2}W_0\) added a rank-1 matrix and left singular vectors of \(\Sigma ^\frac{1}{2}W_0\) are \(I_d\), \(\left\| \tilde{U}_{1:k}^\top \Sigma ^\frac{1}{2}\frac{{\varvec{b}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right\| ^2_{\Sigma _{1:k}^{-2}}= \frac{\lambda _k^{-2}\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\) does not hold and feature learning decreases the bias experimentally (Fig. 6).

(2) The second one is the case where the true signal is sparse and the estimation error of the first-layer is sufficiently small.

Corollary 7

Let k satisfy the same assumption as Theorem 2 and \(\tilde{U}\) be given in Theorem 4. \(\{{\varvec{e}}_1,\dots ,{\varvec{e}}_d\}\) denotes the standard basis of \(\mathbb {R}^d\) and we assume that the eigenvalues of \(\Sigma \) are not degenerate. In Theorem 4, if \({\varvec{a}}^*=\lambda _{k^*}^{-\frac{1}{2}}{\varvec{e}}_{k^*}~(k^*\le k)\) and \(\frac{\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\) is sufficiently small, we have that for \(\tilde{U}_{1:\infty }=(\tilde{{\varvec{u}}}_2,\dots ,\tilde{{\varvec{u}}}_d)\)

$$\begin{aligned} \tilde{{\varvec{u}}}_i = {\varvec{e}}_{i-1}+\mathcal {O}\left( \frac{\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right) \quad (2\le i\le k^*),\nonumber \\ \tilde{{\varvec{u}}}_i = {\varvec{e}}_{i}+\mathcal {O}\left( \frac{\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right) \quad (k^*<i\le d).\nonumber \end{aligned}$$

In this case, it holds that

$$\begin{aligned} \bar{B}_1\le & {} \frac{\Vert {\varvec{b}}_{1:k}\Vert ^2_{\Sigma ^{-1}_{1:k}}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\left( \frac{\lambda ^{(2)}+\lambda _1+\sum _{i>k}\lambda _i}{n}\right) ^2 +\frac{\Vert {\varvec{b}}_{k:\infty }\Vert ^2_{\Sigma _{k:\infty }}}{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}+o\left( \frac{\Vert {\varvec{b}}\Vert ^2_\Sigma }{\tilde{{\varvec{a}}}^\top \Sigma {\varvec{a}}^*}\right) +o_\mathbb {P}(1). \end{aligned}$$

4 Numerical experiments

In this section, we present numerical experiments and discuss the validity of our theoretical results we provided in Sect. 3.

4.1 Experimental setting

In this subsection, we describe the overall experimental settings in Sect. 4. We assume that \(\Sigma =\text{ diag }(\lambda _1,\dots ,\lambda _d)\) satisfies \(\lambda _i=i^{-(1+\alpha )}~(\alpha >-1)\) (recall that \(\Sigma \) is the covariance matrix of input data \({\varvec{x}}\)). Namely, we suppose that the distribution of eigenvalues polynomially decay and \(\alpha \) determines the decay rate. Here, note that \(\lim _{d\rightarrow \infty }\sum _{i=1}^d\lambda _i<\infty \) when \(\alpha >0\), and \(\lim _{d\rightarrow \infty }\sum _{i=1}^d\lambda _i=\infty \) when \(-1<\alpha \le 0\). Furthermore, we asuume that input data \({\varvec{x}}\) and noise independently follow Gaussian distributions: \({\varvec{x}} \overset{\text {i.i.d.}}{\sim } \mathcal {N}(0,\Sigma ),~\epsilon \overset{\text {i.i.d.}}{\sim } \mathcal {N}(0,\sigma _\epsilon ^2)\).

In Sect. 3, for ease of analysis, we constructed the theory in a setting where the data were split into sampled data \((X,{\varvec{y}})\) and resampled data \((\tilde{X},\tilde{{\varvec{y}}})\). In this section, however, experiments are conducted in a setting where the data are not split, which is closer to the real setting. To be more precisely, we compute \(W_1\) from data \((X,{\varvec{y}})\) according to (2) and then compute \(\hat{{\varvec{a}}}_{\textrm{FL}},~\hat{{\varvec{a}}}_{\textrm{RF}},\) and \(\hat{{\varvec{a}}}_{\textrm{LR}}\) from the same data in the following way:

$$\begin{aligned} \hat{{\varvec{a}}}_{\textrm{FL}}= & {} W_1 W_1^\top X^\top (XW_1 W_1^\top X^\top +\lambda ^{(2)} I_n)^{-1}{\varvec{y}}, \nonumber \\ \hat{{\varvec{a}}}_{\textrm{RF}}= & {} W_0 W_0^\top X^\top (XW_0 W_0^\top X^\top +\lambda ^{(2)} I_n)^{-1}{\varvec{y}}, \nonumber \\ \hat{{\varvec{a}}}_{\textrm{LR}}= & {} X^\top (XX^\top +\lambda ^{(2)} I_n)^{-1}{\varvec{y}}. \nonumber \end{aligned}$$

Here, \(\hat{{\varvec{a}}}_{\textrm{FL}},~\hat{{\varvec{a}}}_{\textrm{RF}}\) and \(\hat{{\varvec{a}}}_{\textrm{LR}}\) represent the ridge estimators corresponding to the one-step feature learning method, the random feature model and the vanilla linear regression. In Sect. 8, we present some comparisons of the experimental results with and without splitting. The qualitative behavior of the results is largely the same, throughout the experiment, regardless of the splitting setting.

We approximate the bias and variance of the predictive error given by these estimators with numerical calculation. The way how to calculate them is as follows. For \(\textrm{ES}\in \{\textrm{FL},\textrm{RF},\textrm{LR}\}\), we calculate \(\hat{\mathscr {B}}_{\textrm{ES}}\) and \(\hat{\mathscr {V}}_{\textrm{ES}}\) M times over M independent realizations of the training data. For the i-th realization of \(X_i\) and \({\varvec{\epsilon }}_i\) of the training data, we calculate the bias \(\hat{\mathscr {B}}_{\textrm{ES}}^{(i)}\) and variance \(\hat{\mathscr {V}}^{(i)}_{\textrm{ES}}~(i\in [M])\) as

$$\begin{aligned} \hat{\mathscr {B}}_{\textrm{ES}}^{(i)}= & {} \Vert {\varvec{a}}^*-P_{\textrm{ES}}X_i^\top (\lambda ^{(2)}I_d+X_iP_{\textrm{ES}}X_i^\top )^{-1}X_i{\varvec{a}}^*\Vert ^2_\Sigma ,\nonumber \\ \hat{\mathscr {V}}_{\textrm{ES}}^{(i)}= & {} \Vert P_{\textrm{ES}}X_i^\top (\lambda ^{(2)}I_d+X_iP_{\textrm{ES}}X_i^\top )^{-1}{\varvec{\epsilon }}_i\Vert ^2_\Sigma \nonumber \end{aligned}$$

where

$$\begin{aligned} P_{\textrm{ES}} = {\left\{ \begin{array}{ll} I_d~(\textrm{ES} = \textrm{LR}), \\ W_0W_0^\top ~(\textrm{ES} = \textrm{RF}),\\ W_1W_1^\top ~(\textrm{ES} = \textrm{FL}). \end{array}\right. }\nonumber \end{aligned}$$

Then, we calculate their average \(\frac{1}{M}\sum _{i=1}^M\hat{\mathscr {B}}^{(i)}_{\textrm{ES}}\) and \(\frac{1}{M}\sum _{i=1}^M\hat{\mathscr {V}}^{(i)}_{\textrm{ES}}\) as estimates of their true expectation \(\mathbb {E}_X[\mathscr {B}_{\textrm{ES}}]\) and \(\mathbb {E}_X[\mathscr {V}_{\textrm{ES}}]\) respectively.

In the following, unless otherwise mentioned, each parameter is set as follows: Sample size \(n=100\), dimension of input data \(d=200\), number of neurons in the middle layer \(N=1000,~\alpha =-0.5,~k^* = 100,~\sigma _\epsilon ^2=0.01\) and \(\lambda ^{(1)}=\lambda ^{(2)}=0.\)

4.2 Experiments for small step size

In this subsection, we perform experiments for the small step size setting corresponding to Sect. 3.1. Although the smaller the step size the more accurate our analysis is, we employ rather large step size \(\eta =1\) because the improvement of the predictive error is hidden behind the numerical errors. Here, we let \(M=1000\).

We observe the relationship between each parameter and the bias where \({\varvec{a}}^*_{k^*}=1/\sqrt{\lambda _{k^*}},~{\varvec{a}}^*_i=0~(\forall i\ne k^*)\) for \(k^*\in [d]\). Here, we calculate \(\frac{\mathscr {B}_{\textrm{FL}}-\mathscr {B}_{\textrm{LR}}}{\mathscr {B}_{\textrm{LR}}}\) and \(\frac{\mathscr {B}_{\textrm{FL}}-\mathscr {B}_{\textrm{RF}}}{\mathscr {B}_{\textrm{RF}}}\) to compare the bias given by GD and NGD feature learning methods with those given by linear regression model or random feature model. According to Corollaries 3 and 4, the change in the bias due to feature learning consists of the first term which makes the bias smaller and the second term which makes it larger.

In Fig. 1, we see that feature learning decreases the relative bias well when \(\tilde{{\varvec{a}}}\) is aligned on \({\varvec{a}}^*\) well. In Figs. 2 and 3, we see that when d and \(\sigma _\epsilon ^2\) is increased, the variance \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert ^2]\) becomes larger and the relative bias is increased especially due to feature learning with NGD. In Fig. 4, we see that proper regularization can reduce the bias further.

Figure 1 shows the relationship between \(k^*\) and the bias for some specific \(\alpha \).

Fig. 1
figure 1

The relationship between \(k^*\) and the bias

When \(k^*\) is small, the relative bias is increased due to feature learning with NGD. Especially, as \(\alpha \) gets larger, the relative bias gets larger except for \(k^*=1\) in (d). This is because, when \(k^*\) is small, the first term of Corollary 3 which reduces the bias is small and the second term which increases it is large. In contrast, the relative bias is reduced due to feature learning with GD where \(k^*\) is small. Especially, the reduction is large when \(\alpha \) is large. This is because, when \(k^*\) is small, \({\varvec{a^*}}\) is aligned on input and \(\tilde{{\varvec{a}}}\) in (4) with GD estimates the signal well while \(\tilde{{\varvec{a}}}\) with NGD does not due to the large variance \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert ^2]\). Here, the comparisons with linear regression model behave differently from that with random feature model where feature learning is GD. This is simply because the difference between \(\hat{{\varvec{a}}}_{\textrm{RF}}\) and \(\hat{{\varvec{a}}}_{\textrm{LR}}\) is huge for finite N/d where \(\alpha \) is large or \(k^*\) is small (Fig. 10).

We next see that the relative bias for NGD feature learning decreases as \(k^*\) is increased because the decrease depending on \(k^*\) is \(\frac{1}{\lambda _{k^*}} = {k^*}^{(1+\alpha )}\). Moreover, the rate of decreasing with increasing \(k^*\) is fast when \(\alpha \) is large. In general, regression is difficult when \(k^*\) is large because the principle component of \({\varvec{x}}\) is different from the direction of \({\varvec{a}}^*\). This means one step feature learning with NGD is beneficial for difficult instances. On the contrary, as \(k^*\) gets larger, the decrease in the relative bias due to feature learning with GD gets small because the decrease depending on \(k^*\) is \({k^*}^{-(1+\alpha )}\). Furthermore, for larger \(\alpha \), the decrease in bias approaches zero more quickly with increasing \(k^*\). Hence, the benefit of one step feature learning with GD is approximately zero for difficult instances though that is large for easy instances.

Figure 2 shows how the bias changes against different d when \(k^*=100\) is fixed.

In both (a) and (b), the bias with NGD increases as d is increased. This is because the second term of Corollary 3 is monotonically increases with respect to d; indeed, we see that \(\frac{d-k}{n}\frac{1}{\sum _{i>k}\lambda _i} = \mathcal {O}(\sqrt{d})\) for \(\alpha =-0.5\). This matches the empirical observation in the graph. In contrast, the bias with GD decreases slowly because of the second term of Corollary 4. In fact, the term \(\frac{\sum _{i>k}\lambda _i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}=\mathcal {O}(\log d/\sqrt{d})\) for \(\alpha =-0.5\) and it decreases slowly as d gets larger. The second term of Corollaries 3 and 4 intuitively means that the bias is increased when \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert _\Sigma ^2]\) is large. This gets larger as d gets larger where \(\tilde{{\varvec{a}}}\) consists of NGD feature learning while this gets smaller as ratio \(k^*/d\) gets smaller where \(\tilde{{\varvec{a}}}\) consists of GD feature learning.

Figure 3 shows how the bias changes against \(\sigma _\epsilon ^2\). Here, let \(M =10000\) in order to suppress bias fluctuation when \(\sigma _\epsilon ^2\) is increased.

The relative bias with NGD feature learning is linearly increasing with respect to \(\sigma ^2_\epsilon \). This is because \(\frac{d-k}{n}\frac{1}{\sum _{i>k}\lambda _i}\) which is the coefficient of \(\sigma ^2_\epsilon \) is positive. In contrast, the relative bias with GD feature learning is flat against increasing \(\sigma ^2_\epsilon \). This is because \(\frac{1}{n}\frac{\sum _{i>k}\lambda _i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}\) which is also coefficient of \(\sigma ^2_\epsilon \) is almost zero.

Figure 4a, b show how the bias changes against positive \(\lambda ^{(1)}\) and \(\lambda ^{(2)}\) respectively. Here, as \(\lambda ^{(1)}\) is increased, the step size is effectively smaller and the estimator of the feature learning model approaches that of a random feature model. Therefore, let the step size be \(\eta (1+\lambda ^{(1)})\) to compensate for this.

Note that \(\lambda ^{(1)}=0\) corresponds to a situation where the feature learning is performed by NGD, and \(\lambda ^{(1)} = \infty \) corresponds to that where GD is used. In Fig. 4a, we set \(k^*=60\) where the relative bias with NGD and that with GD are equal, and indeed the values of the bias at \(\lambda ^{(1)} = 0\) and \(\lambda ^{(1)} = 9\) are approximately equal. We can also see that feature learning decreases the relative bias the most where \(\lambda ^{(1)}=0.2\). This indicates that one step feature learning to the intermediate gradient direction between NGD and GD reduces the bias better than just NGD or GD. This can be explained as follows. When feature learning is performed with NGD, \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\tilde{{\varvec{a}}}]={\varvec{a}}^*\) and the first term of Corollary 3 which decreases the bias is large while \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert ^2]\) is large and the second term of Corollary 3 which increases the bias is also large. In contrast, when feature learning is performed with GD, \(\Vert \mathbb {E}_{X,{\varvec{\epsilon }}}[\tilde{{\varvec{a}}}]-{\varvec{a}}^*\Vert ^2\) is large and the first term of Corollary 3 which decreases the bias is small while \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert ^2]\) is small and the second term of Corollary 3 which increases the bias is also small. Therefore, by setting a value of \(\lambda ^{(1)}\) appropriately, feature learning between NGD and GD balances \(\Vert \mathbb {E}_{X,{\varvec{\epsilon }}}[\tilde{{\varvec{a}}}]-{\varvec{a}}^*\Vert ^2\) and \(\mathbb {E}_{X,{\varvec{\epsilon }}}[\Vert \tilde{{\varvec{a}}}-{\varvec{a}}^*\Vert ^2]\) and then reduce the bias better.

Next, as for (b), the relative bias with NGD decreases with increasing \(\lambda ^{(2)}\). This is because \(\frac{1}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}\) which is the second term of Corollary 3 is decreasing with increasing \(\lambda ^{(2)}\). On the other hand, the gleen plot which is the relative bias with GD hardly decreases. This is because \(\frac{\sum _{i>k}\lambda _i^2}{\lambda ^{(2)}+\sum _{i>k}\lambda _i}\) which increases the bias is already small even where \(\lambda ^{(2)}=0\). Here, the reason why the blue plot decreases is simply that the bias with linear regression model is larger than that with random feature model when \(\lambda ^{(2)}\) is small.

We will now discuss the variance of Corollaries 3 and 4. In Fig. 5, we plot \(\frac{\mathscr {V}_{\textrm{FL}}-\mathscr {V}_{\textrm{LR}}}{\mathscr {V}_{\textrm{LR}}}\) and \(\frac{\mathscr {V}_{\textrm{FL}}-\mathscr {V}_{\textrm{RF}}}{\mathscr {V}_{\textrm{RF}}}\) for several \(\alpha \).

(a) shows that the variance of estimators with feature learning is almost the same as that with random features (i.e., no feature learning). Though it appears that the variance of feature learning models is worse than that of the linear regression in (b), this is simply because the variance of the random feature model is larger than that of the linear regression model. The fact that feature learning makes the variance almost no worse can be explained as follows. The variance given by linear regression is mostly determined by the decay rate of the eigenvalues of the covariance matrix \(\mathbb {E}[{\varvec{x}}{\varvec{x}}^T]\). Here, in feature learning models, the decay rates of the covariance matrix of the data passing through the first layer \(\mathbb {E}[{\varvec{x}}W_1W_1^\top {\varvec{x}}^\top ]\) is almost the same as that of \(\mathbb {E}[{\varvec{x}}{\varvec{x}}^T]\) because \(W_1\) is given by adding just a rank-1 matrix to \(W_0\) which approximately satisfies \(W_0W_0^\top = I\). Therefore, the variance is not changed much by one step feature learning.

4.3 Experiments for large step size

In this subsection, we perform experiments for the large step size setting which corresponds to our theoretical analysis in Sect. 3.2. In updating first layer, we write the feature learning model as \(\hat{{\varvec{a}}}_{\mathrm {FL_{GD}}}\), \(\hat{{\varvec{a}}}_{\mathrm {FL_{NGD}}}\) and \(\hat{{\varvec{a}}}_{\mathrm {FL_{WLR}}}\) corresponding to \(W_1 = W_0 + \eta P_1(\lambda ^{(1)}) \nabla _{W_0}\mathcal {L}\), \(W_1 = W_0 + \eta P_2(\lambda ^{(1)}) \nabla _{W_0}\mathcal {L}\) and \(W_1 = W_0 + \eta P_3(\lambda ^{(1)}) \nabla _{W_0}\mathcal {L}\) respectively.

Figure 6 depicts the bias \(\mathscr {B}_{\textrm{ES}}\) corresponding to each \(\hat{{\varvec{a}}}_\textrm{ES}\) for \(\textrm{ES}\in \{\mathrm {FL_{GD}},~\mathrm {FL_{NGD}}, ~\mathrm {FL_{WLR}},~\textrm{RF},~\textrm{LR}\}\) against the step size \(\eta \) where \({\varvec{a}}^*_{k^*}=1/\sqrt{\lambda _{k^*}},~{\varvec{a}}^*_i=0~(\forall i\ne k^*)\).

Fig. 2
figure 2

The relationship between d and the bias

Fig. 3
figure 3

The relationship between \(\sigma _\epsilon ^2\) and the bias

Fig. 4
figure 4

The relationship between \(\lambda ^{(1)},~\lambda ^{(2)}\) and the bias

Fig. 5
figure 5

The relationship between \(\alpha \) and the variance

Fig. 6
figure 6

The change in each bias when \(\eta \) is varied

We observe that \(\mathscr {B}_{\mathrm {FL_{WLR}}}\) and \(\mathscr {B}_{\mathrm {FL_{GD}}}\) decrease as \(\eta \) gets larger while \(\mathscr {B}_{\mathrm {FL_{NGD}}}\) hardly changes. This can be explained as follows. When \({\varvec{a}}^*\) is sparse and \(\Vert {\varvec{b}}\Vert ^2_\Sigma \) is tiny, from Corollary 7, the bias of the upper k components is decreased to less than the estimation error in the first layer while that of the rest of components are the same. Therefore, if the weights of the difference \(\tilde{{\varvec{a}}}-{\varvec{a}}^*\) are concentrated in the upper components, the bias can be further reduced by increasing \(\eta \). In contrast, if the weights are distributed more widely after k, increasing \(\eta \) does not reduce the bias much. Then, from the proof of Corollary 3, the error between \(\tilde{{\varvec{a}}}\) updated by NGD and \({\varvec{a}}^*\) is isotropically distributed throughout the whole space. Hence, since \(k=o(d)\), the weights of \(\tilde{{\varvec{a}}}-{\varvec{a}}^*\) are mostly distributed over the components after k and \(\mathscr {B}_{\mathrm {FL_{NGD}}}\) does not decrease even if \(\eta \) is increased. On the other hand, the weights of the error between \(\tilde{{\varvec{a}}}\) updated by GD and WLR and \({\varvec{a}}^*\) are distributed to the upper k components and the components after k. Thus, \(\mathscr {B}_{\mathrm {FL_{WLR}}}\) and \(\mathscr {B}_{\mathrm {FL_{GD}}}\) decrease with increasing \(\eta \).

Next, we discuss the effect of the step size \(\eta \) to the variance. Figure 7 shows that the variance of each estimator as a function of the step size \(\eta \).

Fig. 7
figure 7

\(\alpha =-0.5\)

From Fig. 7 , we can see that the variance of any feature learning method is almost the same as that of the random feature model. This is in accordance with the note of Theorem 4 that the eigenvalues of \(\Sigma ^\frac{1}{2}W_1W_1^\top \Sigma \) are almost the same as \(\Sigma ^\frac{1}{2}W_0W_0^\top \Sigma \) even though \(W_0\) is updated with a rank-1 matrix to \(W_1\).

Finally, we show the predictive error of each estimator as a function of the step size \(\eta \) in Fig. 8. Note that the predictive error is the sum of the bias and variance \(\mathscr {B}_{\textrm{ES}}+\mathscr {V}_{\textrm{ES}}\), which can be obtained as the sum of the graphs in Figures 6 and 7.

Fig. 8
figure 8

The change in each overall Predictive Error when \(\eta \) is varied

Figure 8 shows that the estimator of the feature learning model outperforms that of the one-layer linear regression model not only in terms of bias but also in terms of overall predictive error, especially with NGD feature learning. Here, in this experiment, the variance is about one order of magnitude smaller than the bias, and thus the behavior of the predictive error in Fig. 8 is almost dominated by the bias. However, if we multiply \(\sigma ^2_\epsilon \) by 10 and let the bias and variance to the same magnitude, the behavior of the predictive error will still follow the bias approximately. This is because, from Figs. 6 and 7, the difference of bias induced by feature learning is about two orders of magnitude larger than that of the variances.

5 Conclusion and future work

Summary We analyzed the effect of one-step feature learning for the two-layer linear neural network to its predictive error in a high dimensional regression setting and gave detailed evaluations of the bias and variance in two step size regimes (small step size and large step size). In particular, for the small step size, we gave the comparison between the predictive error of the estimator obtained by feature learning and that given by one-layer linear regression in a concrete form, and confirmed the correspondence with numerical experiments. Theoretical and experimental results showed that feature learning is more effective when the decay rate is slow \(\alpha <0\) in both cases.

Future Work In this study, we gave the analysis of a one step feature learning model for the case where the number of neurons is sufficiently large, but the analysis with a finite width network is important because a neural network with a finite number of neurons is used in practice. In addition, this study analyzed a linear neural network without a nonlinear activation function in the intermediate layer, so the estimator of this model has serious limitations as it can only represent linear functions on input data. Therefore, the analysis of feature learning models including a nonlinear activation function is an extremely important issue.