Abstract
The aim of this paper is twofold. First, we show that a certain concatenation of a proximity operator with an affine operator is again a proximity operator on a suitable Hilbert space. Second, we use our findings to establish so-called proximal neural networks (PNNs) and stable tight frame proximal neural networks. Let \(\mathcal {H}\) and \(\mathcal {K}\) be real Hilbert spaces, \(b \in \mathcal {K}\) and \(T \in \mathcal {B} (\mathcal {H},\mathcal {K})\) a linear operator with closed range and Moore–Penrose inverse \(T^\dagger \). Based on the well-known characterization of proximity operators by Moreau, we prove that for any proximity operator \(\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}\) the operator \(T^\dagger \, \mathrm {Prox}( T \cdot + b)\) is a proximity operator on \(\mathcal {H}\) equipped with a suitable norm. In particular, it follows for the frequently applied soft shrinkage operator \(\mathrm {Prox}= S_{\lambda }:\ell _2 \rightarrow \ell _2\) and any frame analysis operator \(T:\mathcal {H}\rightarrow \ell _2\) that the frame shrinkage operator \(T^\dagger \, S_\lambda \, T\) is a proximity operator on a suitable Hilbert space. The concatenation of proximity operators on \(\mathbb R^d\) equipped with different norms establishes a PNN. If the network arises from tight frame analysis or synthesis operators, then it forms an averaged operator. In particular, it has Lipschitz constant 1 and belongs to the class of so-called Lipschitz networks, which were recently applied to defend against adversarial attacks. Moreover, due to its averaging property, PNNs can be used within so-called Plug-and-Play algorithms with convergence guarantee. In case of Parseval frames, we call the networks Parseval proximal neural networks (PPNNs). Then, the involved linear operators are in a Stiefel manifold and corresponding minimization methods can be applied for training of such networks. Finally, some proof-of-the concept examples demonstrate the performance of PPNNs.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Wavelet and frame shrinkage operators became very popular in recent years. A certain starting point was the iterative shrinkage-thresholding algorithm (ISTA) in [16], which was interpreted as a special case of the forward-backward algorithm in [14]. For relations with other algorithms see also [8, 40]. Let \(T \in \mathbb R^{n \times d}\), \(n\ge d\), have full column rank. Then, the problem
is known as the analysis point of view. For orthogonal \(T \in {\mathbb R}^{d \times d}\), the solution of (1) is given by the frame soft shrinkage operator \(T^\dagger \, S_\lambda \, T = T^* \, S_\lambda \, T\), see Example 2.3. If \(T \in \mathbb R^{n \times d}\) with \(n \le d\) and \(T T^* = I_n\), the solution of problem (1) is given by \(I_d - T^* T + T^* S_\lambda T\), see [6, Theorem 6.15]. For arbitrary \(T \in {\mathbb R}^{n \times d}\), \(n \ge d\), there are no analytic expressions for the solution of (1) in the literature.
The question whether the frame shrinkage operator can itself be seen as a proximity operator has been recently studied in [20]. They showed that the set-valued operator \((T^\dagger S_\lambda T)^{-1} - I_d\) is maximally cyclically monotone, which implies that it is a proximity operator with respect to some norm in \(\mathbb R^d\). In this paper, we prove that for any operator \(T\in \mathcal {B} (\mathcal {H},\mathcal {K})\) with closed range, \(b \in \mathcal {K}\) and any proximity operator \(\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}\) the new operator \(T^\dagger \, \mathrm {Prox}\, ( T \cdot + b) :\mathcal {H}\rightarrow \mathcal {H}\) is also a proximity operator on the linear space \(\mathcal {H}\), but equipped with another inner product. The above mentioned finite dimensional setting is included as a special case. In contrast to [20], we directly approach the problem using a classical result of Moreau [34]. Moreover, we provide the function for the definition of the proximity operator. Here, we like to mention that this function can be also deduced from Proposition 3.9 in [12]. However, since this deduction appears to be more space consuming than the direct proof of our Theorem 3.4, we prefer to give a direct approach. Note that different norms in the definition of the proximity operator were successfully used in variable metric algorithms, see [10].
Recently, it was shown that many activation functions appearing in neural networks are indeed proximity functions [13]. Based on this observations and our previous findings, we consider neural networks that are concatenations of proximity operators and call them proximal neural networks (PNNs). PNNs can be considered within the framework of the variational networks proposed in [29]. Due to stability reasons, PNNs related to linear operators from Stiefel manifolds are of special interest. They form so-called averaged operators and are consequently nonexpansive. Orthogonal matrices have already shown advantages in training recurrent neural networks (RNNs) [3, 4, 28, 31, 47, 49]. Using orthogonal matrices, vanishing or exploding gradients in training RNNs can be avoided [17]. The more general setting of learning rectangular matrices from a Stiefel manifold was proposed, e.g., in [25], but with a different focus than in this paper. The most relevant paper with respect to our setting is [26], where the authors considered the so-called optimization over multiple dependent Stiefel manifolds (OMDSM). We will see that the NNs in [26] are special cases of our PNNs so that our analysis ensures that they are averaged operators.
Using matrices from Stiefel manifolds results in 1-Lipschitz neural networks. Consequently, our approach is naturally related to other methods for controlling the Lipschitz constant of neural networks, which provably increases robustness against adversarial attacks [46]. In [23], the constant is controlled by projecting back all weight matrices in the network that violate a pre-defined threshold on the \(\Vert \cdot \Vert _p\) norm, \(p \in [1,\infty ]\), of the weight matrices. The authors in [39] characterize the singular values of the linear map associated with convolutional layers and use this for projecting a convolutional layer onto an operator-norm ball. Another closely related approach is spectral normalization as proposed in [33], where the spectral norm of every weight matrix is enforced to be one. Compared to our approach, this only restricts the largest singular value of the linear operators arising in the neural network. Limitations of the expressiveness of networks with restricted Lipschitz constants in every layer were discussed in [2, 27]. Note that our approach does not restrict the Lipschitz constants in every individual layer. Further, none of the above approaches is able to impose more structure on the network such as being an averaged operator.
Our results may be of interest in so-called Plug-and-Play algorithms [9, 42, 45]. In these algorithms a well-behaved operator, e.g., a proximity operator, is replaced by an efficient denoiser such as a neural network. However, training a denoising framework without structure can lead to a divergent algorithm, see [41]. In contrast, it was shown in [44] that a particular version of a Plug-and-Play algorithm converges if the network is averaged.
Our paper is organized as follows: We begin with preliminaries on convex analysis in Hilbert spaces in Sect. 2. In Sect. 3, we prove our general results on the interplay between proximity and certain affine operators. As a special case we emphasize that the frame soft shrinkage operator is itself a proximity operator in Sect. 4. In Sect. 5, we use our findings to set up neural networks as a concatenation of proximity operators on \(\mathbb R^d\) equipped with different norms related to linear operators. If these operators are related to tight frames, our proposed network is actually an averaged operator. In case of Parseval frames, the involved matrices are in Stiefel manifolds and we end up with PPNNs. Sect. 6 deals with the training of PPNNs via stochastic gradient descent on Stiefel manifolds. In Sect. 7, we provide first numerical examples. Finally, Sect. 8 contains conclusions and addresses further research questions.
2 Preliminaries
Let \(\mathcal {H}\) be a real Hilbert space with inner product \(\langle \cdot ,\cdot \rangle \) and norm \(\Vert \cdot \Vert \). By \(\Gamma _0(\mathcal {H})\) we denote the set of proper, convex, lower semi-continuous functions on \(\mathcal {H}\) mapping into \((-\infty , \infty ]\). For \(f \in \Gamma _0(\mathcal {H})\) and \(\lambda > 0\), the proximity operator \(\mathrm {prox}_{\lambda f}:\mathcal {H}\rightarrow \mathcal {H}\) and its Moreau envelope \(M_{\lambda f}:\mathcal {H}\rightarrow \mathbb {R}\) are defined by
Clearly, the proximity operator and its Moreau envelope depend on the underlying space \(\mathcal {H}\), in particular on the chosen inner product. Recall that an operator \(A:\mathcal {H}\rightarrow \mathcal {H}\) is called firmly nonexpansive if for all \(x,y \in \mathcal {H}\) the following relation is fulfilled
Obviously, firmly nonexpansive operators are nonexpansive.
For a Fréchet differentiable function \(\Phi :\mathcal {H}\rightarrow \mathbb {R}\), the gradient \(\nabla \Phi (x)\) at \(x \in \mathcal {H}\) is defined as the vector satisfying for all \(h \in \mathcal {H}\),
where \(D\Phi :\mathcal {H}\rightarrow \mathcal {B} (\mathcal {H},\mathbb {R})\) denotes the Fréchet derivative of \(\Phi \), i.e., for all \(x,h \in \mathcal {H}\),
Note that the gradient crucially depends on the chosen inner product in \(\mathcal {H}\). The following results can be found, e.g., in [5, Propositions 12.27, 12.29].
Theorem 2.1
Let \(f \in \Gamma _0(\mathcal {H})\). Then, the following relations hold true:
(i) The operator \(\mathrm {prox}_{\lambda f} :\mathcal {H}\rightarrow \mathcal {H}\) is firmly nonexpansive.
(ii) The function \(M_{\lambda f}\) is (Fréchet) differentiable with Lipschitz-continuous gradient given by
Clearly, (ii) implies that
where \(\Phi := \frac{1}{2} \Vert \cdot \Vert ^2 - M_{\lambda f}\) is convex as \(\mathrm {prox}_{\lambda f}\) is nonexpansive [5, Proposition 17.10]. Further, it was shown by Moreau that also the following (reverse) statement holds true [34, Corrolary 10c].
Theorem 2.2
The operator \(\mathrm {Prox}:\mathcal {H}\rightarrow \mathcal {H}\) is a proximity operator if and only if it is nonexpansive and there exists a function \(\Psi \in \Gamma _0(\mathcal {H})\) with \(\mathrm {Prox}(x) \in \partial \Psi (x)\) for any \(x \in \mathcal {H}\), where \(\partial \Psi \) denotes the subdifferential of \(\Psi \).
Thanks to (3), we conclude that \(\mathrm {Prox}:\mathcal {H}\rightarrow \mathcal {H}\) is a proximity operator if and only if it is nonexpansive and the gradient of a convex, differentiable function \(\Phi :\mathcal {H}\rightarrow \mathbb {R}\). Recently, the characterization of Bregman proximity operators in a more general setting was discussed in [24]. In the following example, we recall the Moreau envelope and the proximity operator related to the soft thresholding operator.
Example 2.3
Let \(\mathcal {H}= \mathbb {R}\) with usual norm \(|\cdot |\) and \(f(x) := |x|\). Then, \(\mathrm {prox}_{\lambda f}\) is the soft shrinkage operator \(S_\lambda \) defined by
and the Moreau envelope is the Huber function
Hence, \(\mathrm {prox}_{\lambda f} = \nabla \varphi \), where \(\varphi (x) = \frac{x^{2}}{2} - m_{\lambda | \cdot |}(x)\), i.e.,
For \(\mathcal {H}= \mathbb R^d\) and \(f(x) := \Vert x\Vert _1\), we can use a componentwise approach. Then, \(S_\lambda \) is defined componentwise, the Moreau envelope reads as \(M_{\lambda \Vert \cdot \Vert _1} (x) = \sum _{i=1}^d m_{\lambda | \cdot |}(x_i)\) and the potential of \(\mathrm {prox}_{\lambda \Vert \cdot \Vert _1}\) is \(\Phi (x) = \sum _{i=1}^d \varphi (x_i)\).
3 Interplay Between Proximity and Linear Operators
Let \(\mathcal {H}\) and \(\mathcal {K}\) be real Hilbert spaces with inner products \(\langle \cdot ,\cdot \rangle _\mathcal {H}\) and \(\langle \cdot ,\cdot \rangle _\mathcal {K}\) and corresponding norms \(\Vert \cdot \Vert _\mathcal {H}\) and \(\Vert \cdot \Vert _\mathcal {K}\), respectively. By \(\mathcal {B}(\mathcal {H},\mathcal {K})\) we denote the space of bounded, linear operators from \(\mathcal {H}\) to \(\mathcal {K}\). The kernel and the range of \(T\in \mathcal {B}(\mathcal {H},\mathcal {K})\) are denoted by \(\mathcal N(T)\) and \(\mathcal R(T)\), respectively. In this section, we show that for any nontrivial operator \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) with closed range \(\mathcal {R}(T)\), \(b \in \mathcal {K}\) and proximity operator \(\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}\), the operator \(T^\dagger \, \mathrm {Prox}( T \cdot + b):\mathcal {H}\rightarrow \mathcal {H}\) is itself a proximity operator on the linear space \(\mathcal {H}\) equipped with a suitable (equivalent) norm \(\Vert \cdot \Vert _{{\mathcal {H}_{T}}}\), i.e., there exits a function \(f \in \Gamma _0(\mathcal {H})\) such that
Throughout this section, let \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) have closed range. Then, the same holds true for its adjoint \(T^*:\mathcal {K}\rightarrow \mathcal {H}\) and the following (orthogonal) decompositions hold
The Moore–Penrose inverse (generalized inverse, pseudo-inverse) \(T^\dagger \in \mathcal {B}(\mathcal {K},\mathcal {H})\) is given point-wise by
see [5]. Further, it satisfies \(\mathcal {R} (T^\dagger ) = \mathcal {R} (T^*)\) and
where \(P_{C}\) is the orthogonal projection onto the closed, convex set C, see [5, Proposition 3.28]. Then, it follows
If T is injective, then \(T^\dagger = (T^*\,T)^{-1} T^*\) and if T is surjective, we have \(T^\dagger = T^*(T\,T^*)^{-1}\).
Every \(T\in \mathcal {B}(\mathcal {H},\mathcal {K})\) gives rise to an inner product in \(\mathcal {H}\) via
with corresponding norm
If T is injective, the second summand vanishes. In general, this norm only induces a pre-Hilbert structure. Since \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) has closed range, the norms \(\Vert \cdot \Vert _\mathcal {H}\) and \(\Vert \cdot \Vert _{\mathcal {H}_{T}}\) are equivalent on \(\mathcal {H}\) due to
and
for all \(x \in \mathcal {H}\). The norm equivalence also ensures the completeness of \(\mathcal {H}\) equipped with the new norm. To emphasize that we consider the linear space \(\mathcal {H}\) with this norm, we write \({\mathcal {H}_{T}}\). For special \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\), the inner product (7) coincides with the one in \(\mathcal {H}\).
Lemma 3.1
Let \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) fulfill \(T^* T = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \mathrm {Id}_\mathcal {H}\) or \(T T^* = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \mathrm {Id}_\mathcal {K}\), where \(\mathrm {Id}_\mathcal {H}\) and \(\mathrm {Id}_\mathcal {K}\) denote the identity operator on \(\mathcal {H}\) and \(\mathcal {K}\), respectively. Then, the inner product (7) coincides with the one in \(\mathcal {H}\) and consequently \(\mathcal {H}= \mathcal {H}_T\).
Proof
If \(T^* T = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\mathrm {Id}_\mathcal {H}\), then T is injective such that (7) implies
If \(T T^* = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\mathrm {Id}_\mathcal {K}\), then (6) implies that \(P_{\mathcal {N}(T)} = \mathrm {Id}_\mathcal {H}- T^\dagger T = \mathrm {Id}_\mathcal {H}- T^* T/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\) and
\(\square \)
To apply the characterization of proximal mappings in \(\mathcal {H}_T\) by Moreau, see Theorem 2.2, we have to compute gradients in \(\mathcal {H}_T\). Here, the following result is crucial.
Lemma 3.2
Let \(\mathcal {H}\) and \(\mathcal {K}\) be real Hilbert spaces with inner products \(\langle \cdot ,\cdot \rangle _\mathcal {H}\) and \(\langle \cdot ,\cdot \rangle _\mathcal {K}\), respectively. For an operator \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) with closed range, let \(\mathcal {H}_T\) be the Hilbert space with inner product (7). For (Fréchet) differentiable \(\Phi :\mathcal {H}\rightarrow \mathbb R\), the gradients \(\nabla _\mathcal {H}\Phi \) and \(\nabla _{\mathcal {H}_T} \Phi \) with respect to the different inner products are related by
Proof
The gradient \(\nabla _{\mathcal {H}_{T}}\Phi (x)\) at \(x \in \mathcal {H}\) in the space \({\mathcal {H}_{T}}\) is given by the vector satisfying
for all \(h \in \mathcal {H}\). Since
the gradient depends on the chosen inner product through
\(\square \)
Now, the desired result follows from the next theorem.
Theorem 3.3
Let \(b \in \mathcal {K}\), \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) have closed range and \(\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}\) be a proximity operator on \(\mathcal {K}\). Then, the operator \(A := T^\dagger \, \mathrm {Prox}\, (T \cdot + b) :{\mathcal {H}_{T}}\rightarrow {\mathcal {H}_{T}}\) is a proximity operator.
Proof
In view of Theorems 2.1 and 2.2, it suffices to show that A is nonexpansive and that there exists a convex function \(\Psi :{\mathcal {H}_{T}}\rightarrow \mathbb {R}\) with \(A = \nabla _{\mathcal {H}_{T}}\Psi \).
1. First, we show that A is firmly nonexpansive, and thus nonexpansive. By (4), we see that
Using this and (5), it follows
and since \(\mathrm {Prox}\) is firmly nonexpansive with respect to \(\Vert \cdot \Vert _{\mathcal {K}}\), see (2), the estimate (10) further implies that A is firmly nonexpansive
2. It remains to prove that there exists a convex function \(\Psi :{\mathcal {H}_{T}}\rightarrow \mathbb {R}\) with \(\nabla _{\mathcal {H}_{T}}\Psi = A\). Since \(\mathrm {Prox}\) is a proximity operator, there exists \(\Phi :\mathcal {H}\rightarrow \mathbb {R}\) with \(\mathrm {Prox}= \nabla _\mathcal {K}\Phi \). Then, a natural candidate is given by \(\Psi =\Phi \, (T \cdot + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\). Using the definition of the gradient and the chain rule, it holds for all \(x,h\in \mathcal {H}\) that
Incorporating Lemma 3.2, we conclude
which implies \(T^*\, T\, \nabla _{\mathcal {H}_{T}}\Psi = T^* \, \mathrm {Prox}\, (Tx + b)\) and \(\nabla _{\mathcal {H}_{T}}\Psi \in \mathcal {R}(T^*)\). By definition of \(T^\dagger \), we obtain \(\nabla _{\mathcal {H}_{T}}\Psi = A\). Finally, \(\Psi \) is convex as it is the concatenation of a convex function with a linear function. \(\square \)
Let
denote the infimal convolution of \(f,g \in \Gamma _0(\mathcal {H})\) and \(x \mapsto \iota _S(x)\) the indicator function of the set S taking the value 0 if \(x \in S\) and \(+\infty \) otherwise.
For \(\mathrm {Prox}:= \mathrm {prox}_{g}\) with \(g \in \Gamma _0(\mathcal {H})\), we are actually able to explicitly compute \(f \in \Gamma _0(\mathcal {H})\) such that \(T^\dagger \, \mathrm {Prox}\, (T \cdot + b) = \mathrm {prox}_f\) on \({\mathcal {H}_{T}}\). Clearly, this also gives an alternative proof for Theorem 3.3.
Theorem 3.4
Let \(b \in \mathcal {K}\), \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) with closed range and \(\mathrm {Prox}:= \mathrm {prox}_{g}\) for some \(g \in \Gamma _0(\mathcal {K})\). Then, \(T^{\dagger } \, \mathrm {prox}_{g} \, (T \cdot + b) :{\mathcal {H}_{T}}\rightarrow {\mathcal {H}_{T}}\) is the proximity operator on \({\mathcal {H}_{T}}\) of \(f \in \Gamma _0(\mathcal {H})\) given by
This expression simplifies to
Proof
and by (5) further
Hence, we conclude that \(T^{\dagger } \, \mathrm {prox}_{g} \, (T \cdot + b)\) is the proximity operator on \({\mathcal {H}_{T}}\) of f in (11). \(\square \)
Note that for surjective T and \(b=0\), the function f is in general a weaker regularizer than g. This is necessary since for the latter (12) would lead to
4 Frame Soft Shrinkage as Proximity Operator
In this section, we investigate the frame soft shrinkage as a special proximity operator. Let \(\mathcal {K}= \ell _2\) be the Hilbert space of square summable sequences \(c = \{c_k\}_{k \in \mathbb N}\) with norm \(\Vert c \Vert _{\ell _2} := ( \sum _{k \in \mathbb N} |c_k|^2)^{\frac{1}{2}}\) and assume that \(\mathcal {H}\) is separable. A sequence \(\{x_k\}_{k\in \mathbb {N}}\), \(x_k \in \mathcal {H}\), is called a frame of \(\mathcal {H}\), if constants \(0< A \le B < \infty \) exist such that for all \(x \in \mathcal {H}\),
Given a frame \(\{x_k\}_{k\in \mathbb {N}}\) of \(\mathcal {H}\), the corresponding analysis operator \(T :\mathcal {H}\rightarrow \ell _2\) is defined as
Its adjoint \(T^*:\ell _2 \rightarrow \mathcal {H}\) is the synthesis operator given by
By composing T and \(T^*\), we obtain the frame operator
which is invertible on \(\mathcal {H}\), see [11], such that
The sequence \(\{ (T^*T)^{-1}x_k\}_{k\in \mathbb {N}}\) is called the canonical dual frame of \(\{ x_k\}_{k\in \mathbb {N}}\). If
then \(\{x_k\}_{k \in \mathbb N}\) is called a tight frame, and for \(T^*T = \mathrm {Id}_\mathcal {H}\) a Parseval frame. Here, Lemma 3.1 comes into the play. Note that \(T^\dagger \) is indeed the synthesis operator for the canonical dual frame of \(\{ f_k\}_{k\in \mathbb {N}}\). The relation between linear, bounded, injective operators of closed range and frame analysis operators is given in the next proposition.
Proposition 4.1
-
(i)
An operator \(T \in \mathcal {B}(\mathcal {H},\ell _2)\) is injective and has closed range if and only if it is the analysis operator of some frame of \(\mathcal {H}\).
-
(ii)
An operator \(T \in \mathcal {B}(\ell _2,\mathcal {H})\) is surjective if and only if it is the synthesis operator of some frame of \(\mathcal {H}\).
Proof
(i) If T is the analysis operator for a frame \(\{x_k\}_{k\in \mathbb {N}}\), then T is bounded, injective and has closed range, see [11]. Conversely, assume that \(T \in \mathcal {B}(\mathcal {H},\ell _2)\) is injective and that \(\mathcal {R}(T)\) is closed. By (4), it holds \(\mathcal {R}(T^*) = \mathcal H\). Let \(\{\delta _k\}_{k\in \mathbb {N}}\) be the canonical basis of \(\ell _2\) and set \(\{x_k \}_{k\in \mathbb {N}}:= \{T^{*} \delta _k\}_{k\in \mathbb {N}}\). Since \(\sum _{k\in \mathbb {N}} |\langle x,x_k \rangle _\mathcal {H}|^2 = \Vert Tx \Vert _{\ell _2}^2\), we conclude that \(\{x_k \}_{k\in \mathbb {N}}\) is a frame of \(\mathcal {H}\) and that T is the corresponding analysis operator.
(ii) Let \(\{x_k\}_{k\in \mathbb {N}} = \{T\delta _k\}_{k\in \mathbb {N}}\). Then, the result follows from [11, Theorem 5.5.1]. \(\square \)
The soft shrinkage operator \(S_\lambda \) on \(\ell _2\) (applied componentwise) is the proximity operator corresponding to the function \(g := \lambda \Vert \cdot \Vert _1\), \(\lambda >0\). As immediate consequence of Theorem 3.4 we obtain the following corollary.
Corollary 4.2
Assume that \(T:\mathcal {H}\rightarrow \ell _2\) is an analysis operator for some frame of \(\mathcal {H}\) and \(\mathrm {Prox}:\ell _2 \rightarrow \ell _2\) is an arbitrary proximity operator. Then, \(T^{\dagger } \, \mathrm {Prox}\, T \) is itself a proximity operator on \(\mathcal {H}\) equipped with the norm \(\Vert \cdot \Vert _{\mathcal {H}_{T}}\). In particular, if \(\mathrm {Prox}:= S_\lambda \) with \(\lambda >0\), then
Finally, let us have a look at the finite dimensional setting with \(\mathcal {H}:= \mathbb {R}^d\), \(\mathcal {K}:= \mathbb {R}^n\), \(n\ge d\). Then, we have for any \(T \in \mathbb {R}^{n,d}\) with full rank d and the proximity operator \(S_\lambda \) with \(\lambda >0\) on \(\mathbb {R}^n\) that
Example 4.3
We want to compute f for the matrix \(T:\mathbb {R}^{1} \rightarrow \mathbb {R}^{2}\) given by \(T = ( 1 , 2)^\mathrm {T}\) and the soft shrinkage operator \(S_\lambda \) on \(\mathbb {R}^2\) with \(\lambda >0\). Note that this example was also considered in [20]. By (13) and since \(x = (x_1, x_2)^\mathrm {T}\in \mathcal {N}(T^*)\) if and only if \(x_1 = -2 x_2\), we obtain
Consider the strictly convex function \(g_y(x_2) = \lambda \vert y +2x_2 \vert + \lambda \vert 2y-x_2 \vert + \frac{5}{2} x_2^2\). For \(\vert y \vert \le \frac{2}{5} \lambda \), it holds
Hence, by Fermat’s theorem, the unique minimizer of \(g_y(x_2)\) is given by \(-\frac{y}{2}\). Consequently, we have for \(\vert y \vert \le \frac{2}{5}\lambda \) that
For \(\vert y \vert > \frac{2\lambda }{5}\), the function \(g_y\) is differentiable in \(-\frac{\lambda }{5} \mathrm {sgn}(y)\) and it holds
Therefore, for \(\vert y \vert > \frac{2\lambda }{5}\), the minimizer of \(g_y\) is \(-\frac{\lambda }{5} \mathrm {sgn}(y)\) and
Choosing, e.g., \(\lambda = \frac{1}{3}\) we obtain
which is a good approximation of \(\tfrac{1}{5} \vert y \vert \).
5 Proximal Neural Networks
In this section, we consider neural networks (NNs) consisting of \(K \in {\mathbb N}\) layers with dimensions \(n_{1}, \ldots , n_{K}\) defined by mappings \(\Phi = \Phi (\cdot \,; u):{\mathbb R}^{d} \rightarrow {\mathbb R}^{n_{K}}\) of the form
Such NNs are composed of affine functions \(A_{k}:{\mathbb R}^{n_{k-1}} \rightarrow {\mathbb R}^{n_{k}}\) given by
with weight matrices \(L_{k} \in {\mathbb R}^{n_{k}, n_{k-1}}\), \(n_{0}=d\), bias vectors \(b_{k} \in {\mathbb R}^{n_{k}}\) as well as a non-linear activation \(\sigma :\mathbb {R} \rightarrow \mathbb {R}\) acting at each component, i.e., for \({x}= (x_{j})_{j=1}^{n}\) we have \(\sigma (x) = (\sigma (x_{j}))_{j=1}^{n}\). The parameter set \(u:=\left( L_k,b_k\right) _{k=1}^{K}\) of such a NN has the overall dimension \(D:= n_0 n_1 + n_1 n_2 + \dots + n_{K-1}n_K + n_1 + \dots + n_K\). For an illustration see Fig. 1.
In [13], the notation of stable activation functions was introduced. An activation function \(\sigma :\mathbb R \rightarrow \mathbb R\) is called stable if it is monotone increasing, 1-Lipschitz continuous and satisfies \(\sigma (0) = 0\). The following result was shown in [13].
Lemma 5.1
A function \(\sigma :\mathbb {R} \rightarrow \mathbb {R}\) is a stable activation function if and only if there exists \(g \in \Gamma _0(\mathbb R)\) having 0 as a minimizer such that \(\sigma = \mathrm {prox}_{g}\).
Various common activation functions \(\sigma \) and corresponding functions \(g \in \Gamma _0(\mathbb R)\) are listed in Table 3 in the appendix. For \(T_k \in \mathbb R^{n_k,d}\), we consider the norm (8) and denote it by
In the previous sections, we have considered two different kinds of proximity operators, namely \(\mathrm {prox}_g\) with respect to the Euclidean norm
and \(\mathrm {prox}_{T_k,g}\) with respect to the norm (16)
Further, we derived a function \(f_k\) depending on g, \(T_k\) and \(b_k\), see Theorem 3.4, such that
Based on our observations in the previous sections, we consider the following special NNs. We choose a stable activation function \(\sigma = \mathrm {prox}_g\) for some \(g \in \Gamma _0(\mathbb R)\) and matrices \(T_{k} \in {\mathbb R}^{n_{k} , d}\), as well as bias vectors \(b_{k} \in {\mathbb R}^{n_{k}}\), \(k=1, \ldots , K\), and construct according to (15) the affine mappings
Then, the NN \(\Phi :\mathbb {R}^d\rightarrow \mathbb {R}^{n_K}\) in (14) with \(A_{k}\), \(b_{k}\) in (18) can be rewritten as
We call \(\Phi \) a proximal neural network (PNN) with network parameters \(u := (T_k,b_k)_{k=1}^{K}\).
Next, we investigate stability properties of such networks. Recall that an operator \(\Psi :\mathcal {H}\rightarrow \mathcal {H}\) on a Hilbert space \(\mathcal {H}\) is \(\alpha \)-averaged, \(\alpha \in (0,1)\), if there exists a nonexpansive operator \(R:\mathcal {H}\rightarrow \mathcal {H}\) such that
The following theorem summarizes properties of \(\alpha \)-averaged operators, c.f. [5] and [38] for the third statement.
Theorem 5.2
Let \(\mathcal {H}\) be a separable real Hilbert space. Then the following holds true:
-
(i)
An operator on \(\mathcal {H}\) is firmly nonexpansive if and only if it is \(\frac{1}{2}\)-averaged.
-
(ii)
The concatenation of K operators which are \(\alpha _k\)-averaged with respect to the same norm is \(\alpha \)-averaged with \( \alpha = \frac{K}{K-1 + 1/\max _k \alpha _k} \).
-
(iii)
For an \(\alpha \)-averaged operator \(\Psi :\mathcal {H}\rightarrow \mathcal {H}\) with a nonempty fixed point set, the sequence generated by the iteration
$$\begin{aligned} x^{(r+1)} = \Psi \bigl (x^{(r)} \bigr ) \end{aligned}$$converges weakly for every starting point \(x^{(0)} \in \mathcal {H}\) to a fixed point of \(\Psi \).
In the following, we study special PNNs, which are \(\alpha \)-averaged operators such that \(x^{(r+1)} = \Phi (x^{(r)};u)\) converges to a fixed point of \(\Phi \) if such a point exists.
Lemma 5.3
-
(i)
Let \(T_k\in \mathbb R^{n_k,d}\) fulfill \(T_k^* T_k = \Vert T_k \Vert ^2 I_{d}\) or \(T_k T_k^* = \Vert T_k \Vert ^2 I_{n_{k}}\) for all \(k=1,\ldots , K-1\) and let \(T_K = I_d\). Then \(\Phi \) in (19) is \(\alpha \)-averaged with \(\alpha = \frac{K-1}{K}\).
-
(ii)
Let \(T_1 \in \mathbb R^{n_k,d}\) with full column rank fulfill \(\Vert T_1\Vert ^2 T_k^* T_k =\Vert T_k\Vert ^2 T_1^* T_1\) for \(k=1,\ldots ,K-1\) and \(T_K = I_d\). Then \(\Phi \) in (19) is \(\alpha \)-averaged with \(\alpha = \frac{K-1}{K}\).
Proof
(i) By Lemma 3.1, we know that \(\Vert \cdot \Vert _{T_k} = \Vert \cdot \Vert _2\) so that \(\Phi \) is the concatenation of \(K-1\) proximity operators on \(\mathbb R^d\) with respect to the Euclidean norm. More precisely,
with \(f_k\) as in Theorem 3.4. Now, the assertion follows from Theorem 5.2.
(ii) By assumption, we obtain \(\Vert x\Vert _{T_k} = x^* T_k^* T_k x /\Vert T_k\Vert ^2 = x^* T_1^* T_1 x/\Vert T_1\Vert ^2 = \Vert x\Vert _{T_1}\). Hence, \(\Phi \) becomes the concatenation of \(K-1\) proximity operators on \(\mathbb R^d\) all with respect to the \(\Vert \cdot \Vert _{T_1}\) norm. Again, the assertion follows from Theorem 5.2. \(\square \)
Remark 5.4
Lemma 5.3(i) can be generalized to the case where \(T_K\in \mathbb {R}^{d,d}\) is a symmetric positive semi-definite matrix with norm not larger than 1. In this case, \(T_K\) can be written in the form \(T_K=Q^*Q\) for some \(Q\in \mathbb {R}^{d,d}\) with \(\Vert Q\Vert ^2=\Vert Q^*Q\Vert =\Vert T_K\Vert \le 1\). Thus, for every \(x,y\in \mathbb {R}^d\),
This shows that \(T_K\cdot +b_K\) is firmly nonexpansive and therefore \(\frac{1}{2}\)-averaged. Consequently, \(\Phi \) in (19) is the concatenation of K \(\frac{1}{2}\)-averaged operators with respect to the Euclidean norm. Hence, \(\Phi \) is itself \(\alpha \)-averaged with \(\alpha = \frac{K}{K+1}\).
Remark 5.5
In [13], the following NN structure was studied: Let \(\mathcal {H}_0, \ldots ,\mathcal {H}_K\) be a sequence of real Hilbert spaces and \(\mathcal {H}_0 = \mathcal {H}_K = \mathcal {H}\). Further, let \(W_k \in {\mathcal B} (\mathcal {H}_{k-1},\mathcal {H}_k)\) and \(P_k:\mathcal {H}_k \rightarrow \mathcal {H}_k\), \(k=1,\ldots ,K\) be firmly nonexpansive operators. For this case, Combettes and Pesquet [13] have posed conditions on \(W_k\) such that
is \(\alpha \)-averaged for some \(\alpha \in (1/2,1)\). For \(\mathcal {H}= \mathbb R^d\) equipped with the Euclidean norm, \(\mathcal {H}_k = \mathbb R^d\) equipped with the norm (16) and \(T_K = I_d\), our PNN \(\Phi \) has exactly the form (20) with \(P_k := \mathrm {prox}_{T_k,f_k}:\mathcal {H}_k \rightarrow \mathcal {H}_k\) and the embedding operators \(W_k:\mathcal {H}_{k-1} \hookrightarrow \mathcal {H}_k\), \(k=1,\ldots ,K\). For the special PNNs in Lemma 5.3 it holds \(W_k = I_d\), such that the conditions in [13] are fulfilled.
In the rest of this paper, we restrict our attention to matrices \(T_k:\mathbb R^{d} \rightarrow \mathbb R^{n_k}\) fulfilling
and \(T_{K} = I_{d}\), i.e., the rows, resp. columns of \(T_k\) form a Parseval frame. Then, the PNN in (19) has the form
with the “usual” proximity operator, cf. (17), and
Due to the use of Parseval frames, we call these networks Parseval (frame) proximal neural networks (PPNNs). By our previous considerations, see Lemma 5.3, PPNNs are averaged operators.
Remark 5.6
An interesting result follows from convergence considerations of the cyclic proximal point algorithm, see [7]. Let \(\{\lambda _r\}_{r \in \mathbb N} \in \ell _2 \setminus \ell _1\). Then, for every \(x^{(0)} \in \mathbb R^d\), the sequence generated by
converges to a minimizer of \(f_1 + \cdots + f_{K-1}\). In particular, Theorem 3.4 implies that for orthogonal matrices \(T_k\), \(k=1,\ldots ,K-1\) and \(T_K = I_d\), \(b_K = 0\), the sequence \(\{x^{(r)}\}_{r \in \mathbb N}\) in (21) converges to
6 Training PPNNs on Stiefel Manifolds
In this section, we show how to train PPNNs.
Remark 6.1
According to Lemma 5.3(i), we could add more flexibility to our model by allowing tight frames instead. Then, we must train an additional scaling constant, which does not introduce difficulties in the training process and may be useful for special applications, see [2]. In our numerical experiments, we omitted the additional scaling constant as we do not want to focus on this particular issue.
In PPNNs, we assume that either \(T_k\) or \(T_k^*\), \(k=1,\ldots ,K-1\), is an element of the Stiefel manifold
The following facts on Stiefel manifolds can be found, e.g., in [1]. For \(d\le n\), the (compact) Stiefel manifold is defined as
For \(d=1\) this reduces to the sphere \(\mathbb S^{n-1}\), and for \(d=n\) we obtain the special orthogonal group \(\mathrm {SO}(n)\). In general, \(\mathrm {St}(d,n)\) is a manifold of dimension \(nd -\frac{1}{2} d(d+1)\) with tangential space at \(T \in \mathrm {St}(d,n)\) given by
where the columns of \(T_{\perp }\in \mathbb {R}^{n,n-d}\) are the basis of an orthonormal complement of T fulfilling \(T^*_{\perp } T_{\perp } = I_{n-d}\) and \(T^*T_{\perp }=0\). The Riemannian gradient of a function on \(\mathrm {St}(d,n)\) can be obtained by the orthogonal projection of the gradient in \(\mathbb R^{n,d}\) onto \(\mathrm {St}(d,n)\). The orthogonal projection of \(X \in \mathbb {R}^{n,d}\) onto \({\mathcal T}_T \mathrm {St}(d,n)\) is given by
To emphasize that for fixed T the matrix W depends on X, we will also write \(W_X\). A retraction \(\mathcal {R}\) on the manifold \(\mathrm {St}(d,n)\) is a smooth mapping from the tangent bundle of \(\mathrm {St}(d,n)\) to the manifold fulfilling \({\mathcal R}_{T}(0) = T\), where 0 is the zero element in \({\mathcal T}_T \mathrm {St}(d,n)\), and with the identification \({\mathcal T}_0({\mathcal T}_{T} \mathrm {St}(d,n)) \cong {\mathcal T}_T \mathrm {St}(d,n)\) the local rigidity condition \( D {\mathcal R}_{T} (0) = \mathrm {Id}_{ {\mathcal T}_T \mathrm {St}(d,n)} \) holds true. A well-known retraction on \(\mathrm {St}(d,n)\) is
where \(\mathrm {qf}(A)\) denotes the Q factor of the decomposition of a matrix \(A\in \mathbb {R}^{n,d}\) with linearly independent columns as \(A=QR\) with \(Q\in \mathrm {St}(d,n)\) and R an upper triangular matrix of size \(d \times d\) with strictly positive diagonal elements. The complexity of the QR decomposition using the Householder algorithm is \(2d^2(n-d/3)\), see [21]. Since the computation of the QR decomposition appears to be time consuming on a GPU, we prefer to apply another retraction, based on the Cayley transform of skew-symmetric matrices W in (23), namely
see [36, 48]. By straightforward computation it can be seen that \(W_X\) and \(W_{P_T X}\) coincide, so that the retraction (25) enlarged to the whole \(\mathbb R^{n,d}\) fulfills
Remark 6.2
The retraction (25) has the drawback that it contains a matrix inversion. In our numerical algorithm, the following simple fixed point iteration is used for computing the matrix \(R = {\mathcal R}_T(X)\) with fixed T and X. By definition, R fulfills the fixed point equation
Starting with an arbitrary \(R^{(0)} \in \mathrm {St}(d,n)\), we apply the iteration
which converges by Banach’s fixed point theorem to the fixed point of (27) if \(\frac{1}{2} \rho (W) < 1\), where \(\rho (W)\) denotes the spectral radius of W.
We want to train a PPNN by minimizing
where \({\ell }:\mathbb R^d \times \mathbb R^d \rightarrow \mathbb R\) is a differentiable loss function on the first d variables.
Example 6.3
Let us specify two special cases of PPNNs with one layer.
-
(i)
For one layer without bias and componentwise soft shrinkage \(\sigma \) as activation function, i.e., summands
$$\begin{aligned} \sum _{i=1}^N {\ell } \bigl (T_1^* \sigma (T_1 x_i);y_i \bigr ), \quad T_1 \in \mathrm {St}(d,n_1), \end{aligned}$$we learn Parseval frames, e.g., for denoising tasks with \(y_i\) as a noisy version of \(x_i\). Here, we want to mention the significant amount of work on dictionary learning, see [18], which starts with the same goal.
-
(ii)
For \(x_i = y_i\), \(i=1,\ldots ,N\), the above network could be used as so-called auto-encoder. Again, for one layer without activation function, \(b_1 = 0\) and \({\ell } = h(\Vert x-y\Vert )\) with some norm \(\Vert \cdot \Vert \) on \(\mathbb R^d\) we get
$$\begin{aligned} \sum _{i=1}^N {\ell } \left( T_1 ^* T_1 x_i ;x_i \right) = \sum _{i=1}^N h\bigl ( \Vert (I_d - T_1 ^* T_1) x_i \Vert \bigr ), \quad T_1^* \in \mathrm {St}(d,n_1). \end{aligned}$$For the Euclidean norm and \(h(x) = x^2\) we get the classical PCA approach and for \(h(x) = x\) the robust rotationally invariant \(L_1\)-norm PCA, recently discussed in [30, 35].
The following remark points out that special cases of our PPNNs were already considered in the literature.
Remark 6.4
In [26], NNs with weight matrices \(L_k \in \mathbb R^{n_k,n_{k-1}}\), \(k \in \{1,\ldots ,K-1\}\), (or their transpose) lying in a Stiefel manifold were examined. The authors called this approach optimization over multiple dependent Stiefel manifolds (OMDSM). Indeed, by the following reasons, these NNs are special cases of our PPNNs if \(n_k \le d\) for all \(k=1,\ldots ,K-1\). In particular, this implies that the NNs considered in [26] (with appropriately chosen last layer) are averaged operators.
(i) Case \(n_{k} \le n_{k-1}\): Let \(L_k^* \in \mathrm {St}(n_k, n_{k-1})\), i.e., \(L_k L_k^* = I_{n_k}\). Choosing an arbitrary fixed \(T_{k-1} \in \mathbb R^{n_{k-1},d}\) with \(T_{k-1} T_{k-1}^* = I_{n_{k-1}}\), we want to find \(T_{k} \in \mathbb R^{n_{k},d}\) such that
It is straightforward to verify that \(T_k := L_k T_{k-1}\) has the desired properties.
Note that if the transposes of \(T_k\) and \(T_{k-1}\) are in a Stiefel manifold, this does not necessarily hold for the transpose of \(T_k T_{k-1}^*\). Therefore, our PPNNs are more general.
(ii) Case \(n_{k-1}< n_k\): Let \(L_k \in \mathrm {St}(n_{k-1},n_k)\), i.e., \(L_k^* L_k = I_{n_{k-1}}\). For an arbitrary fixed \(T_{k-1} \in \mathbb R^{n_{k-1},d}\) with \(T_{k-1} T_{k-1}^* = I_{n_{k-1}}\), we want to find \(T_{k} \in \mathbb R^{n_{k},d}\) fulfilling (29). To this end, we complete \(L_k\) to an orthogonal matrix \(\tilde{L}_k \in \mathbb {R}^{n_k,n_k}\) and \(T_{k-1}\) to a matrix \(\tilde{T}_{k-1}\in \mathbb {R}^{n_k,d}\) with orthogonal rows. By straightforward computation we verify that \(T_k := \tilde{L}_k \tilde{T}_{k-1}\) satisfies (29) such that this case also fits into our PPNN framework.
We apply a stochastic gradient descent algorithm on the Stiefel manifold to find a minimizer of (28). To this end, we compute the Euclidean gradient with respect to one layer and apply the usual backpropagation for multiple layers.
Lemma 6.5
Let
where T or \(T^*\) are in \(\mathrm {St}(d,n)\), and \(\ell \) and \(\sigma \) are differentiable. Set
where the gradient of \(\ell \) is taken with respect to the first d variables. Then it holds for the Euclidean gradient
The proof follows by straightforward computations that are carried out in Appendix.
Now, we can formulate stochastic gradient descent (SGD) for \(\mathcal J\) as in Algorithm 1. This algorithm works for an arbitrary retraction, in particular the retraction in (24). In our numerical computations, we use the special retraction (25) in connection with the iteration scheme. Then, by (26), the projection step 3 of the algorithm can be skipped and the retraction can be directly applied to the Euclidean gradient.
7 Some Numerical Results
In this section, we present simple numerical results to get a first impression on the performance of PPNNs for denoising and classification. More sophisticated examples, which include the full repertoire of fine tuning of NNs, will follow in an experimental paper, see also our conclusions. Throughout this section, we use the quadratic loss function \(\ell (x;y) := \Vert x-y\Vert _2^2\). For training we apply a stochastic gradient descent algorithm. We initialize the matrices \(T_k\in \mathrm {St}(d,n)\) randomly using the orthogonal initializer from Tensorflow. That is, we generate a matrix \(\tilde{T}_k\in \mathbb {R}^{n,d}\) with independent random entries following the standard normal distribution and use the initialization \(T_k=\mathrm {qf}(\tilde{T}_k)\). The batch size and learning rate are given for all examples separately.
Denoising In this experiment, we compare PPNNs with Haar wavelet thresholding both for discrete Haar bases and Haar frames arising from the undecimated (translation invariant) version of the Haar transform. In particular, the experiment is linked to the starting point of our considerations, namely wavelet and frame shrinkage. For further details on the corresponding filter banks we refer to [15, 37]. As quality measure for our experiments we choose the average peak signal to noise ratio (PSNR) over the test set. Recall that for a prediction \(x\in \mathbb {R}^{n}\) and ground truth \(y\in \mathbb {R}^{n}\) the PSNR is defined by
Since we focus on the Haar filter, we restrict our attention to piecewise constant signals with mean 0. By \((x_i,y_i)\in \mathbb {R}^{d}\times \mathbb {R}^{d}\), \(i=1,\ldots ,N\), we denote pairs of piecewise constant signals \(y_i\) of length \(d=2^7=128\) and their noisy versions by \(x_i=y_i+\epsilon _i\), where \(\epsilon _i\) is white noise with standard deviation \(\sigma = 0.1\). For the signal generation, we choose
-
The number of constant parts of \(y_i\) as \(\max \{2,t_i\}\), where \(t_i\) is the realization of a random variable following the Poisson distribution with mean 5;
-
The discontinuities of \(y_i\) as realization of a uniform distribution;
-
The signal intensity of \(y_i\) for every constant part as realization of the standard normal distribution, where we subtract the mean of the signal finally.
Using this procedure, we generate training data \((x_i,y_i)_{i=1}^N\) and test data \((x_i,y_i)_{i=N+1}^{N+N_\text {test}}\) with \(N=500{,}000\) and \(N_\text {test}=1000\). The average PSNR of the noisy signals in the test set is 25.22. We use PPNNs with \(K-1 \in \{1, 2, 3\}\) layers and set \(T_K=I_d\) and \(b_K=0\). In all examples a batch size of 32 and a learning rate of 0.5 is used.
We are interested in two different settings:
1. Learned orthogonal matrices versus Haar basis. First, we consider PPNNs with 128 neurons in each hidden layer and componentwise soft-shrinkage \(S_\lambda \) as activation function. In particular, all matrices \(T_k\) have to be orthogonal. The denoising results of our learned PPNN are compared with the soft wavelet shrinkage with respect to the discrete orthogonal Haar basis in \(\mathbb R^{128}\), i.e., the signal on all 6 scales is decomposed by
where \( H := H_2 \, \cdots \, H_7 \) with matrices
The average PSNRs on the test data are given in Table 1. For determining the optimal threshold in \(S_\lambda \), we implemented two different methods. The first one is 5-fold cross validation (CV). More precisely, the training data is divided into 5 subsets and each is used once as a test set with the remaining samples as training set. The test loss for given \(\lambda \) is averaged over all 5 trials for judging the quality of the model. The tested parameters \(\lambda \) are chosen in [0.05, 0.3] with steps of 0.05 for a NN with one layer. The second method is to set \(\lambda \) as a trainable variable of the neural network and optimize it via stochastic gradient descent (SGD) during the training process. For NNs with two and three layers, the tested parameters are divided by 2 and 3, respectively. It appears that for only one hidden layer the Haar wavelet shrinkage is still better than the learned orthogonal matrix. If we increase the number of layers, then PPNNs lead to a better average PSNR.
Two exemplary noisy signals and their denoised versions are shown in Fig. 2. Since we have learned the orthogonal matrices \(T_k\) with respect to the quadratic loss function, the visual quality of the PPNN denoised signals is clearly not satisfactory even with an improved PSNR. The visual impression of signals denoised by Haar wavelet shrinkage can by improved (smoother signal) by increasing the threshold to \(\lambda = 0.3\), resulting in a worse PSNR. To achieve a similar behavior with orthogonal matrices learned by PPNNs, we have to choose a different loss function.
2. Learned Stiefel matrices versus Haar frame. Haar wavelet shrinkage can be improved by using Haar wavelet frames within a so-called “algorithm á trous”, see [32]. We apply a similar method as in (30), but with a rectangular matrix H whose rows form a Haar frame. More precisely, the Haar filter is used without subsampling. This results, in contrast to the original Haar transform, in a translational invariant multiscale transform. Instead of the matrices \(H_7\), the nonsubsampled (convolution) matrix
is applied to the original signal. Note that we assume a periodic continuation of the signal. In each of the following j steps, \(j=1, \ldots ,6\), we keep the lower part and transform again the upper smoothed part by essentially the same matrix, where \(2^{j}-1\) zeros are inserted between the filter coefficients. The output signal has size \(8\cdot 128\), where the last part of the signal is just the averaged signal, which is equal to zero. Overall, the original signal is multiplied by \(H\in \mathbb {R}^{1024, 128}\), where for \(j\in \{0,\ldots ,6\}\) and \(i\in \{0,\ldots ,128-2j\}\) the \((i+128 j)\)-th row is given by one element of a Haar wavelet frame
and the last 128 rows of H are given by \((\begin{array}{ccc}1&\ldots&1\end{array})\). It is well known that for the above translation invariant Haar frame transform, a scale dependent shrinkage has to be applied to each scale, namely starting with threshold \(\lambda \) the next scales should be thresholded by
For an explanation of this statement we refer to [43]. In summary, we obtain
where \(\tilde{S}_\lambda \) denotes the scale-wise adapted thresholding.
Now, we compare this scale-dependent Haar frame soft thresholding method with a learned PPNN with 1024 neurons in the hidden layers and componentwise soft-shrinkage \(S_\lambda \) as activation function. The optimal threshold in \(S_\lambda \) is again determined either by threefold cross validation, where the tested parameters are chosen in [0.01, 0.1] with steps of 0.01, or by setting \(\lambda \) as a trainable variable of the neural network and optimize it via stochastic gradient descent. We emphasize that in contrast to the Haar frame shrinkage procedure with (31), the same threshold for each component is used in the activation function of our PPNN. The resulting PSNRs are given in Table 2. As expected, using the same threshold in the classical Haar frame shrinkage is worse than the scale-adapted Haar frame shrinkage. PPNNs with learned Stiefel matrices perform better for an increasing number of layers.
Finally, Fig. 3 contains the denoised signals from Fig. 2. The results are visually better than in the previous figure, although still not satisfactory due to the used loss function.
Classification In this example, we train a PPNN for classifying the MNIST data setFootnote 1. The length of the input signals is \(d=28^2\). We consider a PPNN with \(K-1 =5\) layers and \(n_1=n_2=784\), \(n_3=n_4=400\) and \(n_5=200\) neurons in the layers and componentwise applied ReLu activation function \(\sigma (x) = \max (0,x)\). To get 10 output elements (probabilities) in (0, 1), we use an additional sixth layer
with another activation function \(g(x) := \frac{1}{1 + \mathrm {exp} (-x)}\). For training, we use a batch size of 1024 and a learning rate of 5. After 1000 epochs we reach an accuracy of 0.9855 on the test set. One epoch takes about one second on a NVIDIA Quadro M5000 GPU. In Fig. 4, the training and test loss of our PPNN during training are plotted.
Remark 7.1
As already mentioned in Remark 6.4, NNs with Stiefel matrices were also applied in [26]. The authors of [26] reported that the training process using Riemannian optimization on the Stiefel manifold could be unstable or divergent. We do not observe such instabilities in our setting.
Adversarial Attacks Neural networks with bounded Lipschitz constants were successfully applied to defend against adversarial attacks, see [22, 46]. In this example, we demonstrate that a PPNN is more robust under adversarial attacks than a standard neural network. Assume that we have a neural network \(f=(f_1,\ldots ,f_{10}):\mathbb {R}^{28^2}\rightarrow (0,1)^{10}\) for classifying MNIST, e.g., a PPNN as described in the previous example. Further, we have given an input \(x\in \mathbb {R}^{28^2}\). Now, we perform an adversarial attack in the following way:
-
Set \(\nu :={\mathop {\hbox {argmax}}\nolimits _{i\in \{1,\ldots ,10\}}} f_i(x)\), and \(g := \nabla _x \tfrac{f_\nu (x)}{\Vert f(x)\Vert _1}\).
-
Initialize \(\epsilon = 10^{-2}\) and while \(\nu ={\mathop {\hbox {argmax}}\nolimits _{i\in \{1,\ldots ,10\}}} f_i(x-\epsilon g)\) update \(\epsilon =2\epsilon \).
This procedure is applied on two neural networks. More precisely, the first one is the PPNN from the previous example and the second one is a neural network with the same structure as the PPNN, but without the orthogonality constraint. We train the standard neural network using the Adam optimizer with learning rate of \(10^{-4}\) and end up with an accuracy of 0.9863. Then, we perform an adversarial attack on both of these networks and record the norm \(\Vert \epsilon g\Vert _2\) of the noise that changes the prediction. We perform this for each input \(x_k\in [0,255]^{28^2}\), \(k=1,\ldots ,10{,}000\), in the test set and compute the mean, standard deviation and median of these norms. For the PPNN, we record an average norm of \(38.28\pm 24.51\) and a median of 33.71. For the standard neural network, we record an average norm of \(30.48\pm 15.72\) and a median of 28.75. Overall, the PPNN seems to be more stable against such adversarial attacks.
8 Conclusions
In this paper, we have shown that for real Hilbert spaces \(\mathcal {H}\) and \(\mathcal {K}\), a proximity operator \(\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}\) and a linear bounded operator \(T:\mathcal {H}\rightarrow \mathcal {K}\) the operator \(T^\dagger \, \mathrm {Prox}(T \cdot + b)\) with \(b \in \mathcal {K}\) is a proximity operator on \(\mathcal {H}\). As a consequence, the famous frame soft shrinkage operator can be seen as a proximity operator. Using this new relations, we have discussed special neural networks arising from Parseval frames and stable activation functions. Our networks are Lipschitz networks, which are moreover averaged operators. These networks include recently proposed ones containing matrices whose transposes are in a Stiefel manifold and interpret them from another, more general point of view.
In our future work, we want to explore for which learning tasks the higher flexibility of our PPNNs is advantageous. Taking more general operators T into account may be also useful. In particular, we will apply our PNNs within Plug-and-Play algorithms. Another question that we want to address is to constrain our Stiefel matrices further, e.g., towards convolutional networks and to sparsity constraints. Depending on the application, we have to design appropriate loss functions as well as incorporating regularizing terms.
For our experiments the stochastic gradient algorithm on Stiefel manifolds worked well. However, other minimization methods could be taken into account. In [26] for example, the authors proposed an orthogonal weight normalization algorithm that was inspired by the fact that eigenvalue decomposition is differentiable. Finally, we like to mention that a proximal backpropagation algorithm taking implicit instead of explicit gradient steps to update the network parameters during neural network training was proposed in [19].
A better understanding of the convergence of the cyclic proximal point algorithm, see Remark 5.6, and suitable early stopping criteria if the network \(\Phi \) is iteratively used may help to design NNs and to understand their success.
Change history
28 April 2021
A Correction to this paper has been published: https://doi.org/10.1007/s00041-021-09848-9
References
Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton and Oxford (2008)
Anil, C., Lucas, J., Grosse, R.: Sorting out Lipschitz function approximation. In: Chaudhuri, K., Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, vol. 97 of Proceedings of Machine Learning Research, pp. 291–301, Long Beach, California, USA. PMLR (2019)
Arjovsky, M., Shah, A., Bengio, Y. Unitary evolution recurrent neural networks. In: International Conference on Machine Learning, pp. 1120–1128 (2016)
Bansal, N., Chen, X., Wang, Z.: Can we gain more from orthogonality regularizations in training deep networks? In: Advances in Neural Information Processing Systems, pp. 4261–4271 (2018)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2011)
Beck, A.: First-Order Methods in Optimization, vol. 25 of MOS-SIAM Series on Optimization. SIAM (2017)
Bertsekas, D.P.: Incremental proximal methods for large scale convex optimization. Math. Program. 129, 163–195 (2011)
Burger, M., Sawatzky, A., Steidl, G.: First Order Algorithms in Variational Image Processing. In Operator Splittings and Alternating Direction Methods. Springer, New York (2017)
Chan, S.H., Wang, X., Elgendy, O.A.: Plug-and-play ADMM for image restoration: fixed-point convergence and applications. IEEE Trans. Comput. Imaging 3, 84–98 (2016)
Chouzenoux, E., Pesquet, J.-C., Repetti, A.: Variable metric forward-backward algorithm for minimizing the sum of a differentiable function and a convex function. J. Optim. Theory Appl. 162, 107–132 (2014)
Christensen, O.: An Introduction to Frames and Riesz Bases. Springer, New York (2016)
Combettes, P.L.: Monotone operator theory in convex optimization. Math. Program. 170(1), 177–206 (2018)
Combettes, P.L., Pesquet, J.-C.: Deep neural network structures solving variational inequalities. In: Set-Valued and Variational Analysis, pp. 1–28 (2020)
Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. Multisc. Model. Simul. 4, 1168–1200 (2005)
Cvetković, Z., Vetterli, M.: Oversampled filter banks. IEEE Trans. Signal Process. 46, 1245–1255 (1998)
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 57(11), 1413–1457 (2004)
Dorobantu, V., Stromhaug, P.A., Renteria, J.: DIZZYRNN: reparameterizing recurrent neural networks for norm-preserving backpropagation. In: CoRR arXiv:1612.04035 (2016)
Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15(12), 3736–3745 (2006)
Frerix, T., Möllenhoff, T., Moeller, M., Cremers, D.: Proximal backpropagation. Technical report. ArXiv Preprint arXiv:1706.04638 (2018)
Geppert, J.A., Plonka, G.: Frame soft shrinkage operators are proximity operators. Technical report, arXiv preprint arXiv:1910.01820 (2019)
Golub, G.H., Loan, C.F.V.: Matrix Computations. The Johns Hopkins University Press, Baltimore (2013)
Goodfellow, J., Shlens, J., Szegedy, C.: Expalining and harnessing adversarial examples. In: International Conference on Learning Representations (2015)
Gouk, H., Frank, E., Pfahringer, B., Cree, M.: Regularisation of neural networks by enforcing Lipschitz continuity. arXiv:1804.04368 (2018)
Gribonval, R., Nikolova, M.: A characterization of proximity operators. arXiv:1807.04014 (2020)
Harandi, M., Fernando, B.: Generalized backpropagation, etude de cas: orthogonality. In CoRR abs/1611.05927 (2016)
Huang, L., Liu, X., Lang, B., Yu, A.W., Wang, Y., Li, B.: Orthogonal weight normalization: solution to optimization over multiple dependent Stiefel manifolds in deep neural networks. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Huster, T.P., Chiang, C.-Y.J., Chadha, R.: Limitations of the Lipschitz constant as a defense against adversarial examples. In: ECML PKDD 2018 Workshops, pp. 16–29. Springer, New York (2019)
Jing, L., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M., Soljačić, M.: Tunable efficient unitary neural networks (EUNN) and their application to RNNs. In: Proceedings of the 34th International Conference on Machine Learning-Vol. 70, pp. 1733–1741. JMLR. org (2017)
Kobler, E., Klatzer, T., Hammernik, K., Pock, T.: Variational networks: connecting variational methods and deep learning. In: German conference on pattern recognition, pp. 281–293. Springer, New York (2017)
Lerman, G., Maunu, T.: An overview of robust subspace recovery. Proc. IEEE 106(8), 1380–1410 (2018)
Lezcano-Casado, M., Martínez-Rubio, D.: Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. arXiv:1901.08428 3794–3803, (2019)
Mallat, S.: A Wavelet Tour of Signal Processing: The Sparse Way. Elsevier, Amsterdam (2008)
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: International Conference on Learning Representations (2018)
Moreau, J.-J.: Proximité et dualité dans un espace Hilbertien. Bull. Soc. Math. France 93, 273–299 (1965)
Neumayer, S., Nimmer, M., Setzer, S., Steidl, G.: On the rotational invariant \(l_1\)-norm PCA. Linear Algeb. Appl. 587, 243–270 (2019)
Nishimori, Y., Akaho, S.: Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold. Neurocomputing 67, 106–135 (2005)
Plonka, G., Steidl, G.: A multiscale wavelet-inspired scheme for nonlinear diffusion. Int. J. Wavel. Multiresol. Inf. Process. 4(1), 1–21 (2006)
Reich, S.: Weak convergence theorems for nonexpansive mappings in Banach spaces. J. Math. Anal. Appl. 67, 274–276 (1979)
Sedghi, H., Gupta, V., Long, P.M.: The singular values of convolutional layers. In: International Conference on Learning Representations (2019)
Setzer, S.: Operator splittings, Bregman methods and frame shrinkage in image processing. Int. J. Comput. Vis. 92(3), 265–280 (2011)
Sommerhoff, H., Kolb, A., Moeller, M.: Energy dissipation with plug-and-play priors. In: NeurIPS 2019 Workshop (2019)
Sreehariand, S., Venkatakrishnan, S.V., Wohlberg, B.: Plug-and-play priors for bright field electron tomography and sparse interpolation. IEEE Trans. Comput. Imaging 2, 408–423 (2016)
Steidl, G., Weickert, J., Brox, T., Mrázek, P., Welk, M.: On the equivalence of soft wavelet shrinkage, total variation diffusion, total variation regularization, and sides. SIAM Journal on Numerical Analysis 42(2), 686–713 (2004)
Sun, Y., Wohlberg, B., Kamilov, U.: An online plug-and-play algorithm for regularized image reconstruction. IEEE Trans. Comput. Imaging 5, 395–408 (2018)
Teodoro, A.M., Bioucas-Dias, J.M., Figueiredo, M.A.: A convergent image fusion algorithm using scene-adapted Gaussian-mixture-based denoising. IEEE Trans. Image Process. 28(1), 451–463 (2018)
Tsuzuku, Y., Sato, I., Sugiyama, M.: Lipschitz-margin training: scalable certification of perturbation invariance for deep neural networks. Adv. Neural Inf. Process. Syst. 31, 6541–6550 (2018)
Vorontsov, E., Trabelsi, C., Kadoury, S., Pal, C.: On orthogonality and learning recurrent networks with long term dependencies. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3570–3578. JMLR.org (2017)
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142(1–2), 397–434 (2013)
Wisdom, S., Powers, T., Hershey, J., Le Roux, J., Atlas, L.: Full-capacity unitary recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 4880–4888 (2016)
Acknowledgements
Many thanks to the reviewer for the constructive suggestions in relation with Lipschitz networks and Plug-and-Play algorithms. Further, we like to thank J.-C. Pesquet for pointing us to [12], which we were not aware of when writing this paper. Funding by the German Research Foundation (DFG) within the Project STE 571/14-1, the RTG 1932 and the RTG 2088 is gratefully acknowledged. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro M5000 GPU used for this research.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Holger Rauhut.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Article has been updated to publish Open Access.
Appendices
Appendix: Gradient Computation
Proof
Proof of Lemma 6.5: For potential further applications, we compute the gradient of the more general functional
where \(T^\dagger = (T^* T)^{-1} T^*\). Note that the case \(T^\dagger = T^* (T T^* )^{-1}\) can be treated in an analog way. Using \(f_1(T) = (T^* T) ^{-1}\), \(f_2(T) = T^*\), \(f_3(T,b) = \sigma (Tx+b)\), the function \(f(T,b) := T^\dagger \sigma (Tx + b)\) is decomposed as
By the product rule it holds
The involved differentials are given by
Consequently, we obtain
Using the chain rule, we conclude
Thus,
For the gradient with respect to b, we obtain by the chain rule
\(\square \)
Activation Functions
The following table lists many functions f having a proximity \(\sigma \) that is a common activation function in NNs from [13].
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hasannasab, M., Hertrich, J., Neumayer, S. et al. Parseval Proximal Neural Networks. J Fourier Anal Appl 26, 59 (2020). https://doi.org/10.1007/s00041-020-09761-7
Received:
Published:
DOI: https://doi.org/10.1007/s00041-020-09761-7
Keywords
- Lipschitz neural networks
- Averaged operators
- Proximal operators
- Frame shrinkage
- Adverserial robustness
- Optimization on Stiefel manifolds