1 Introduction

Induced by ever-rising compute power and successful applications in several domains, artificial intelligence (AI) systems and especially machine learning (ML) methods attract growing attention for advanced tasks in mechanical engineering [1,2,3]. This is supported by well-established and flexible machine learning frameworks like PyTorch [4] and Tensorflow [5].

One particular application of ML is the solution of parameterized partial differential equations (PPDE), which are traditionally solved by numerical discretization methods like finite element method (FEM), finite difference method (FDM), finite volume method (FVM) or boundary element method (BEM). Based on ML techniques, two new classes of methods arose, namely the neural FEM and neural operator methods [6]. The aim of the current work is to compare these two classes of methods for applications in solid body mechanics in terms of computational effort and accuracy. This is conducted by means of different case studies where the conventional FEM serves as a benchmark. The differences between both classes of methods are schematically summarized in Fig. 1. The neural FEM is applied to a single mechanical task, whereas neural operator methods require prior training on sample solutions of a parameterized mechanical task before solving it for new parameter combinations. The reference solution is obtained by a conventional FEM solver code.

Fig. 1
figure 1

Schematical representation of the paper objectives

The mathematical problem to solve with either method can be described as follows: Let an arbitrary parameterized partial differential equation be given on an open domain B with piecewise smooth boundary \(\Gamma\) in the form:

$$\begin{aligned} \mathcal {N}[\varvec{u}(\varvec{y}) ; \varvec{y}]=\textbf{0} \quad {\text {on}} \,\, B, \quad \mathcal {B}[\varvec{u}(\varvec{y}) ; \varvec{y}]=\textbf{0} \quad {\text {on}} \,\, \Gamma \,\, , \end{aligned}$$
(1)

where \(\mathcal {N}\) is a nonlinear operator on the domain B, \(\mathcal {B}\) an operator on \(\Gamma\) that determines the boundary conditions, and \(\varvec{u}(\varvec{y})\in \mathbb {R}^{d}\) the solutions of the PDE. All quantities are parameterized by \(\varvec{y} \in \mathbb {R}^n\). The mapping

$$\begin{aligned} G: \quad B \cup \Gamma \,\, \times \,\, \mathbb {R}^n \rightarrow \mathbb {R}^{d}, \quad \left( \varvec{X}, \varvec{y}\right) \mapsto \varvec{u}, \quad \varvec{X} \in B \cup \Gamma , \quad n, d \in \mathbb {N} \end{aligned}$$
(2)

is called the solution operator of the PPDE.

Neural FEM resembles a conventional FEM implementation. The artificial neural network (NN) approximates the solution function of a particular realization of the PPDE. All approaches from this class, e.g., physics-informed neural networks (PINN, [7]), the deep energy method (DEM) and competitive PINNs (cPINN) [8, 9], are independent from a spatial discretization of the domain B (grid-independent) and can realize high accuracies, but must be retrained for each new set of parameters.

In contrast, neural operator methods train an NN to behave like the solution operator of a PPDE. Then, the network can be applied to arbitrary combinations of parameters and boundary conditions of boundary value problems (BVP). These methods are particularly characterized by a discretization-independent error, allowing zero-shot super resolution (training on coarse grid, inference on fine grid) but typically require a large amount of training data, which may need to be computed in a numerically expensive way [10]. In this work, the deep operator network (DeepONet) [11] and the Fourier Neural Operator (FNO) [6] are studied as representatives of neural operator methods.

As a hybrid solution, physics-informed variants of neural operator methods address this drawback by incorporating knowledge on the underlying PDE as a regularizing mechanism in the loss function [7]. This can increase accuracy, generalizability and data efficiency [12]. The present contribution investigates physics-informed DeepONet (PIDeepONet) [12] and physics-informed neural operator (PINO) [13] as representatives of physics-informed neural operator methods.

The large variety of families of methods with partially interchangeable components prompts for comparisons and investigations on their performance for different application fields. The work at hand aims to address this gap for the field of solid body mechanics.

The paper is structured as follows. First, we present a literature review on specific properties of NNs and the motivation for the development of PINNs and neural operator methods (Sect. 2). Thereafter, we describe the methodology necessary for the comparative analysis. This comprises an insight into details of the selected ML methods (Sects. 3.2 and 3.3), as well as the comparison methodology including the case studies (Sect. 4). Afterward, the results are presented in comparison with the reference solution from FEM and the specific behavior of each investigated NN method is discussed (Sect. 5). Finally, the results of the performance comparison are summed up and necessary steps for a future use of these NN methods in elastostatics are outlined (Sect. 6).

2 Literature review

The field of machine learning, particularly neural networks, has gained significant traction in recent years due to its applicability in various domains, such as pattern and image recognition [14], classification [15] and prediction and natural language processing [16]. This is driven by high-speed processing provided by massive parallel implementations, adaptivity of NN architectures and transfer learning and remarkable performance for learning in regimes of uncertainties [17]. The effectiveness of NN stems from their outstanding prediction ability that is lined out by the universal approximation theorem [18, 19]. Inspired from the achievements of data-driven NN applications, many approaches for the field of applied mechanics have been investigated with various applications in structural engineering [20] and optimization [21], including hybrid approaches between conventional FEM and NN [22].

The universal function approximation property of NNs resembles closely with the interpolation in conventional numeric methods for Partial Differential Equations (PDE) [23] such as the finite element method (FEM) [24]. Based on this, the class of physics-informed neural networks (PINN) emerges, if a loss function is chosen that incorporates the residual of the PDE [7, 25]. This approach allows to set up a general and scalable PDE solver system without the need to know boundary conditions and constitutive laws a priori but with the option to incorporate former knowledge from observations extracted from structural and mechanical experiments [22, 26]. Several adaptations of PINNs have been proposed [8, 9, 27,28,29], as they treat different boundary conditions separately by embedding them into the loss function in the form of soft constraints, which significantly increases the complexity of the loss landscape and may cause severe convergence issues [30, 31].

In contrast, the neural operator methods exploit that AI-driven approaches to numerical simulations promise the possibility to train fast surrogate models for approximating solutions of PPDEs, which can be evaluated on the order of seconds rather than hours [11]. This means front-loading the computational burden to the training time (offline, including the simulation of training data) so that at inference (online) time, trained models can be evaluated several orders of magnitude faster than with a conventional solver. The approach mentioned imposes a trade-off between the comparatively very high training effort and the expected number of evaluations [32]. Usually, this is justified for optimization tasks [33, 34] and is also suggested for multiscale modeling [35].

Additionally, deep learning-based approaches offer the possibility to compute gradients and sensitivities of PDEs by automatic differentiation (autograd feature in PyTorch and Tensorflow), thus making it possible to solve inverse problems without requiring users to manually differentiate the (forward) solver and implement the corresponding gradients of the simulator with respect to the input [32, 36, 37].

Improvements of the methods discussed here are permanently ongoing, e.g., directly for the original deep energy method [38] or in form of adaptations like the mixed DEM (mDEM) [29] and shallow energy method (SEM) [39]. Performance comparisons on the NN methods that are presented here have been conducted, for example, in weather and climate modeling [40], for PINNs in fluid mechanics [41], in form of a comparison between DEM and graph neural networks (GNN) for solid mechanics [42] and between the two neural operator network methods DeepONet and FNO for general PDEs [43]. However, as of now, a comprehensive investigation that includes several representatives from each class of approaches, along with an in-depth discussion of the associated effects, remains a gap in the literature, especially with a focus on solid body mechanics. The aim of this work is to address this issue.

3 Computational methodology

3.1 Neural network nomenclature

Artificial neural networks are constructed as layers of neurons [1, 28] as illustrated in Fig. 2. Each neuron carries out a (typically nonlinear) activation function. In case of a fully connected neural network (FCNN), each neuron receives its input as linear transformation of all outputs of the neurons on the layer before. The output \(\varvec{\mathcal {R}}^i\) of the \(i^{th}\) layer is thus calculated by:

$$\begin{aligned} \varvec{W}^i&= \varvec{w}^{i} \varvec{\mathcal {R}}^{i-1} + \varvec{b}^i \end{aligned}$$
(3)
$$\begin{aligned} \varvec{\mathcal {R}}^i&= \varvec{a}^i(\varvec{\Theta }^i, \varvec{W}^i) \end{aligned}$$
(4)

The weights \(\varvec{w}^{i}\) and biases \(\varvec{b}^i\) of the \(i^{th}\) layer, together with the parameters \(\varvec{\Theta }^i\) of the activation functions \(\varvec{a}\), form the set of parameters \(\theta\) of the neural network. Other parameters of the network architecture (like number of layers and layer widths) and the optimizer algorithm (e.g., step width) are called hyperparameters and have to be chosen by the NN user or an outer optimization strategy. All the layers between the input and output layer are called hidden layers. The number of (hidden) layers is referred to as the network’s depth, whereas the number of neurons within a layer is called the width of the layer.

The whole network represents an arbitrary (continuous) mapping (Universal approximation theorem [18]) \(\mathcal {R}: \mathbb {R}^n \mapsto \mathbb {R}^m\) from the input to the output side, with n and m the input and output layer width, respectively. To make the network approximate a mapping \(\tilde{\mathcal {R}}\) on a subset \(D \subset \mathbb {R}^n\), the mapping is given indirectly by a set \(T_{tr}\) of training tuples \(t_{tr}^k \left( P^k, \tilde{\mathcal {R}}(P^k) \right) , \,\, P^k \in D, \,\, k \in \mathbb {N}\), which together form the training data set. \(P^k\) are the input samples, \(\tilde{\mathcal {R}}(P^k)\) the corresponding target outputs. A set \(T_{te}\) of testing tuples \(t_{te}^l, \,\, l \in \mathbb {N}\) is required to check the quality of the approximation the network has learned so far. Usually, \(T_{tr} \cap T_{te} = \emptyset\). In the application of a network, arbitrary sets of data within D can be the input, but the exact output is usually unknown and only approximated by the net.

Conventionally, the parameters of the network are adapted by an optimization algorithm like Adam [44]. This algorithm minimizes the empirical risk F (alternatively called loss value) that is calculated as the output of a loss function \(\mathcal {L}\). The latter is commonly defined as the discrepancy between the target outputs and the outputs of the network with the current parameters. A frequent choice for the loss function is mean square error (MSE) for \(N \in \mathbb {N}\) tuples \(t^k\)

$$\begin{aligned} {\rm{MSE}} = \frac{1}{N} \sum _{k=1}^{N} \left( \tilde{\mathcal {R}}(P^k) - \mathcal {R}(P^k) \right) ^2. \end{aligned}$$
(5)

The optimization employs (partial) derivatives of the loss function w.r.t. the parameters in the network. Machine learning networks like PyTorch therefor record all operations acting on a variable from input to output symbolically, so that fast, highly accurate derivation becomes possible. This feature is referred to as autograd [36]. Its use, however, is not limited to derivations w.r.t. network parameters.

Fig. 2
figure 2

Information flow in a fully connected neural network (FCNN). (Adapted from [8])

By default, the NN parameters are initialized randomly before the first optimizer step. This leads to varying results of effort and achieved accuracy during training.

3.2 Neural FEM

In this class of methods, the output of the NN is chosen as the unknown function of the PDE. The loss function is then computed either from the residual of the PDE (classical physics-informed neural networks (PINN) [7] and competitive PINN (cPINN) [9]), or from the potential energy if the minimum principle applies (direct energy method (DEM) [8], mixed DEM (mDEM) [29]), both incorporating the outputs of the NN. Hence, no training data set is needed.

3.2.1 Physics-informed neural networks (PINN)

The original PINN formulation uses an FCNN where the loss function is applied to the squared residual of the PDE on specified collocation points [7],

$$\begin{aligned} F^{{\rm{PINN}}, B} = \frac{1}{N_{f}} \sum _{i=1}^{N_{f}}\left( \mathcal {N}\left[ \varvec{u}_{\theta }\right] \left( \varvec{X}_{i}^{f}\right) \right) ^{2}. \end{aligned}$$
(6)

The boundary conditions are accounted for by an additional term in the form

$$\begin{aligned} F^{{\rm{PINN}}, \Gamma } = \frac{\lambda _{b}}{N_{b}} \sum _{i=1}^{N_{b}}\left( \mathcal {B}\left[ \varvec{u}_{\theta }\right] \left( \varvec{X}_{i}^{b}\right) \right) ^{2}, \end{aligned}$$
(7)

where \(\lambda _{b}\) is a hyperparameter that weighs the error proportions, since the propagated gradients can be of different magnitudes, thus driving the optimization procedure toward an incorrect solution [30]. The total empirical risk is then calculated by

$$\begin{aligned} F^{{\rm{PINN}}}\left( \varvec{u}_{\theta }\right) = F^{{\rm{PINN}}, B} + F^{{\rm{PINN}}, \Gamma }. \end{aligned}$$
(8)

According to [31], the training of PINNs fails even on very simple problems, such as the 1D convection or the reaction–diffusion equation. An analysis of the occurrence of comparable pathologies in elastostatic or elastodynamic contexts requires further research. In the survey at hand, they did not manifest.

Deep collocation method (DCM) The deep collocation method (DCM) [28] is a representative of the classical PINN, where the empirical risk is built up by the squared residual at random collocation points. The constraints are typically accounted for by additional penalty terms but not enforced by a transformation of the output data.

3.2.2 The deep energy method (DEM)

The DEM was originally introduced to calculate finite deformation hyperelasticity [8]. This method as well as methods derived from it require only first derivatives to compute the loss function, thus reducing the numerical complexity. In return, errors are generated by the numerical integration of the energy function.

The solution is sought in form of the displacement field \(\varvec{u}(\varvec{X})\) that corresponds to the minimum total potential energy \(\Pi\). This minimization can be accomplished by choosing the loss function F to calculate the total potential energy \(F:= \Pi\). The input of the NN is points in the reference configuration \(\varvec{X} \in B \cup \Gamma\), in the domain B or on its boundary \(\Gamma\).

A transformation is applied to integrate the geometric boundary conditions: Let the output of the NN be given by \(\varvec{z}(\theta , \varvec{X})\). To retrieve the displacement field \(\varvec{u}_\theta (\varvec{X})\) based on the parameter set \(\theta\) of the NN, the displacements on the boundary are introduced in a separate term \(\varvec{u}_g(\varvec{X})\). Additionally, a mapping \(\varvec{A}(\varvec{X})\) with \(\varvec{A}(\varvec{X})=\textbf{0} { \text{ for } } \varvec{X} \in \Gamma\) is introduced. Then, the output is constructed as:

$$\begin{aligned} \varvec{u}_{\theta }(\varvec{X})=\varvec{u}_{g}(\varvec{X})+\varvec{A}(\varvec{X}) \varvec{z}(\theta , \varvec{X}) \end{aligned}$$
(9)

Now, \(\varvec{u}_\theta\) automatically fulfills the boundary constraints and a nonlinear optimization problem without constraints is obtained. This problem can be solved with an optimization procedure such as the limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm. The corresponding loss function to minimize is

$$\begin{aligned} F^{{\rm{DEM}}}\left( \varvec{u}_{\theta }\right) =\int _{B} \left( W\left( \varvec{F}_{\theta }\right) -\varvec{b} \cdot \varvec{u}_{\theta } \right) \, {\rm{d}} V-\int _{\Gamma _{N}} \overline{\varvec{T}} \cdot \varvec{u}_{\theta } \, {\rm{d}} A \end{aligned}$$
(10)

with the elastic strain energy W, the deformation gradient \(\varvec{F}_\theta = \frac{\partial (\varvec{X} + \varvec{u}_\theta )}{\partial \varvec{X}}\), body forces \(\varvec{b}\) and the traction \(\overline{\varvec{T}} = \varvec{P}\cdot \varvec{N}\) defined as the \(1^{st}\) Piola–Kirchhoff stress tensor \(\varvec{P}\) projection on the outward normal of \(\Gamma\). The deformation gradient \(\varvec{F}_{\theta }\) contains only first derivatives and is retrieved by the autograd feature of the NN framework. The described procedure is shown in Fig. 3.

Fig. 3
figure 3

Information flow in DEM. (Adapted from [8])

Several numerical methods to calculate an approximation of the integrals in the loss function have been suggested [8]. Some examples are the Monte Carlo integration and the trapezoidal rule that we use for the examples presented in Sect. 4.

An alternative variant of the same concept is the shallow energy method (SEM), where the deep NN is replaced by a shallow NN with a single hidden layer, activated by radial basis functions (RBF) [39]. Another enhancement of the basic DEM is the mixed DEM (mDEM) [29], where both displacements \(\varvec{u}_{\theta }\) and stresses \(\varvec{P}_{\theta }\) are calculated by the NN. The deviation from the constitutive law derived from the strain energy W must then be integrated into the loss function to represent the correct material behavior

$$\begin{aligned} F^{{\rm{mDEM}}}=F^{{\rm{DEM}}}+\frac{V}{N_{f}} \sum _{j=1}^{N_{f}}\left| \left| \varvec{P}_{\theta }\left( \varvec{X}_{j}^{f}\right) -\left. \frac{\partial W\left( \varvec{F}_{\theta }\right) }{\partial \varvec{F}}\right| _{\varvec{X}_{j}^{f}}\right| \right| _{2}^{2}. \end{aligned}$$
(11)

The prescribed forces on the Neumann boundary part \(\Gamma _{N}\) can be accounted for directly by a transformation similar to Eq. (9) holding for the geometric boundary conditions on the Dirichlet boundary part \(\Gamma _{D}\). Alternatively, an additional error term can be included to penalize the squared deviation from the prescribed forces.

3.2.3 competitive PINN (cPINN)

cPINNs extend the idea of PINNs by formulating the learning problem as a zero-sum game in the style of a generative adversarial network (GAN) [9] with a Nash equilibrium that corresponds to the analytical solution of the PDE. This avoids the use of the square of the residual, which aims to improve the learning performance. Compared to a classical PINN, cPINN introduces an additional discriminator FCNN with NN parameter set \(\phi\) which is trained to predict errors of the PINN. Let \(\varvec{u}_{\theta }\) be furthermore the output of the PINN and additionally \(\varvec{d}_{\phi }=\left( \varvec{d}_{\phi }^{B}, \varvec{d}_{\phi }^{\Gamma }\right)\) the output of the discriminator network. Then the minimax formulation of the game is given by

$$\begin{aligned} \max _{\phi } \min _{\theta } F^{{\rm{cPINN}}}\left( \varvec{u}_{\theta }, \varvec{d}_{\phi }\right) =\left. \frac{1}{N_{f}} \sum _{i=1}^{N_{f}} \mathcal {N}\left[ \varvec{u}_{\theta }\right] \cdot \varvec{d}_{\phi }^{B}\right| _{\varvec{X}_{i}^{f}}+\left. \frac{\lambda _{b}}{N_{b}} \sum _{i=1}^{N_{b}} \mathcal {B}\left[ \varvec{u}_{\theta }\right] \cdot \varvec{d}_{\phi }^{\Gamma }\right| _{\varvec{X}_{i}^{b}}. \end{aligned}$$
(12)

Zeng et al. [9] solve this optimization problem by using the Adam based competitive gradient descent (ACGD) method.

3.3 Neural operator methods

In a multi-query context, where a PDE must be evaluated for a large number of parameters, classical methods are computationally intensive. This includes both the conventional FEM and the neural FEM methods explained previously. That drawback has motivated a large body of work on model order reduction, which deals with the trade-off toward reduced accuracy, stability and generalizability. Learning solution operators between (infinite-dimensional) function spaces using NNs [45] is a comparatively young field.

In this class of methods, the NN approximates the solution operator of the PPDE, i.e., for a given parameter set, the NN shall output the solution of the PDE at the points of interest. Now, the objective is defined as a risk functional that takes the probability distribution \(\chi\) of the parameter set \(\varvec{y}\) into account [6]

$$\begin{aligned} F=\int _{\chi } \mathcal {L}\left( G_{\theta }(\varvec{y}), G(\varvec{y})\right) {\rm{d}} \chi , \end{aligned}$$
(13)

where G is the solution operator and \(G_{\theta }\) its approximation by the NN.

A possibility to represent operators by means of NNs is a finite-dimensional approximation of the function spaces and interpolation of these spaces by the NN. Let, e.g., the boundary value problem (BVP), be given with approximations to the solutions calculated by traditional FEM on the node points. Now, the solution operator can be sought that maps the volume forces to the displacements on the node points, hence training a discrete operator [46, 47]. However, this approach introduces a grid dependency since the results are only obtained on the initially chosen node points.

Alternatively, the points of interest in space can be included as input parameter into the mapping that shall be represented by the NN,

$$\begin{aligned} \varvec{u}_{\theta }: \quad B \cup \Gamma \times \mathbb {R}^{\left( d+N_{{\rm{dof}}}\right) } \rightarrow \mathbb {R}^{N_{{\rm{dof}}}}, \quad [\varvec{X}, \textbf{y}] \mapsto \varvec{u}_{\theta }(\varvec{X}, \textbf{y}), \end{aligned}$$
(14)

where \(\textbf{y}\) is taken as the parameter of the displacement field \(\varvec{u}_{\theta }\).

In the following, deep operator network (DeepONet) and Fourier neural operator (FNO) are discussed. Both approaches are discretization independent and allow for small generalization errors.

3.3.1 Deep operator network (DeepONet) and physics-informed DeepONet (PIDeepONet)

NN can be employed as universal approximators of continuous functions, as well as for nonlinear continuous operators [11, 12]

$$\begin{aligned} \quad |G(\varvec{y})(\varvec{X})-\underbrace{\varvec{g}\left( \varvec{y}\left( \varvec{X}_{1}\right) , \ldots , \varvec{y}\left( \varvec{X}_{m}\right) \right) }_{{\rm{branch}}} \cdot \underbrace{\varvec{f}(\varvec{X})}_{ {\rm{trunk}}}|<\varepsilon , \end{aligned}$$
(15)

where \(\varvec{f}\) (trunk) and \(\varvec{g}\) (branch) can be represented by various classes of neural networks that satisfy the requirements of the classical universal approximation theorem [18]. It is assumed that the parameter \(\varvec{y}\) is known on sufficiently many grid points m. On this basis, the stacked and unstacked deep operator network (DeepONet) are proposed in [11]. The stacked DeepONet differs from the unstacked one only in the definition of the branch networks. In the unstacked DeepONet, these are combined into one net to facilitate training (Fig. 4).

Let \(\varvec{y}=\left[ \varvec{y}\left( \varvec{X}_{1}\right) , \ldots , \varvec{y}\left( \varvec{X}_{m}\right) \right] ^{\top }\) be the parameter function at discrete points in space. Moreover, let the output of DeepONet be \(G_{\theta }(\textbf{y})(\varvec{X})\). Then, the empirical risk functional for DeepONet is given by

$$\begin{aligned} F^{{\text {DeepONet}}}=\frac{1}{P \, N} \sum _{i=1}^{N} \sum _{j=1}^{P}\left( G_{\theta }\left( \varvec{y}_{i}\right) \left( \varvec{X}_{j}\right) -G\left( \varvec{y}_{i}\right) \left( \varvec{X}_{j}\right) \right) ^{2}. \end{aligned}$$
(16)

Here, N is the number of realizations of the parameter input \(\varvec{y}\) available for training and P is the number of training data known per realization of the input function. The reference solution \(G\left( \varvec{y}_{i}\right)\) is determined by FEM simulations or measurements. A single data point of the training data set consists of a triple of the form \((\varvec{y}, \varvec{X}, \varvec{u}(\varvec{X}))\). If \(P>1\), the discrete parameter \(\varvec{y}\) must be repeated an appropriate number of times in the data set. For N realizations of the parameter and P evaluations per realization, the training data set has a total of \(N \times P\) entries.

Despite of its simple structure, DeepONet can represent a wide range of mappings (i.e., it is very expressive) and allows to achieve small generalization errors. Furthermore, it can be applied very easily to arbitrary parametrizations.

The physics-informed variant PIDeepONet [12] extends the empirical risk functional by adding a physically motivated term ensuring the compliance with the PDE in the weak form. For this purpose, the risk functional is extended by adding the squared residuals of the (nonlinear) differential operators \(\mathcal {N}, \mathcal {B}\)

$$\begin{aligned} \begin{aligned} F^{{\text {PIDeepONet}}}=F^{{\text {DeepONet}}}&+\frac{1}{m N_{y}} \sum _{i=1}^{N_{y}} \sum _{j=1}^{m} \mathcal {N}^{2}\left[ G_{\theta }\left( \textbf{y}_{i}\right) \right] \left( \varvec{X}_{j}\right) \\&+\frac{1}{N_{b} N_{y}} \sum _{i=1}^{N_{y}} \sum _{j=1}^{N_{b}} \mathcal {B}^{2}\left[ G_{\theta }\left( \textbf{y}_{i}\right) \right] \left( \varvec{X}_{j}\right) . \end{aligned} \end{aligned}$$
(17)

In return, no FEM reference data is necessary.

Equation (17) is formulated for the case, where the residual can only be evaluated at the m discretization points, where the parameters \(\varvec{y}\) are known. Alternatively, an extended data set can be generated which contains \(\left\{ \varvec{y}(X_{1}), \ldots , \varvec{y}(X_{m}), X_{1}, \varvec{y}(X_{1}), \ldots , X_{r}, \varvec{y}(X_{r})\right\}\) with r the number of (randomly determined) additional gridpoints.

Fig. 4
figure 4

Information flow in PIDeepONet. Branch and trunk network are linked by the scalar product. (Adapted from [26])

3.3.2 Fourier neural operator (FNO)

An FNO [6] represents the solution operator of a PPDE with the help of a series of Fourier blocks. A Fourier block includes several operations as shown in Fig. 5: i) It applies the Fourier transform \(\mathcal {F}\) (in form of the fast Fourier transform, FFT) on its input \(\varvec{v}^t (\varvec{X})\). ii) It applies a linear transform \(R_\theta ^t\) (parameterized by the NN parameter set \(\theta\)) on the lower Fourier modes and filters out the higher modes. iii) It applies the inverse Fourier transform \(\mathcal {F}^{-1}\). iv) In parallel, the Fourier block applies another linear transform \(W_\theta ^t\) on the input \(\varvec{v}^t (\varvec{X})\). v) Results of both branches are summed up and forwarded to the nonlinear activation function \(\sigma\).

The input \([\varvec{X}, \varvec{y}(\varvec{X})] \in \mathbb {R}^{N_{dof} + d}\) of the network is lifted up to a higher dimension \(d_v\) by a shallow (e.g., single layer) FCNN P with linear activation function. Another FCNN Q projects the output of the last Fourier block onto the output space \(\mathbb {R}^{N_{dof}}\) which results in the FNO output \(\varvec{u}_\theta (\varvec{X})\). The whole process is visualized in Fig. 5 and can be described by the iterative architecture

$$\begin{aligned} \begin{aligned} \varvec{v}_{0}(\varvec{X})&=P(\varvec{y}(\varvec{X}), \varvec{X}) \\ \varvec{v}_{t+1}(\varvec{X})&=\sigma \left( W_{\theta }^{t} \varvec{v}_{t}(\varvec{X})+\mathcal {F}^{-1}\left[ R_{\theta }^{t} \cdot \mathcal {F}\left[ \varvec{v}_{t}\right] \right] (\varvec{X})\right) \\ \varvec{u}_{\theta }(\varvec{X})&=Q\left( \varvec{v}_{K}(\varvec{X})\right) , \end{aligned} \end{aligned}$$
(18)

where K is the number of sequential Fourier blocks.

Fig. 5
figure 5

Schematic representation of the FNO (adapted from [6]). Layers with nonlinear activation (GELU) are marked in blue

Physically informed neural operator (PINO) An extension of the FNO to a physically informed neural operator (PINO) [13] has been investigated as well. FNO computes displacements on an equidistant grid, so that a DEM-like extension is straight forward. Therefore, the same potential energy as for the DEM (see Eq. 10) was added to the loss function

$$\begin{aligned} F^{{\text {PINO}}}=F^{{\rm{FNO}}}+F^{{\rm{DEM}}} = F^{{\rm{FNO}}}+ \int _{B} \left( W\left( \varvec{F}_{\theta }\right) -\varvec{b} \cdot \varvec{u} \right) \, {\rm{d}} V-\int _{\Gamma _{N}} \overline{\varvec{T}} \cdot \varvec{u} \, {\rm{d}} A . \end{aligned}$$
(19)

4 Comparison methodology

The present section describes the methodology to compare neural FEM and neural operator methods. As there are many specific approaches in both classes and the characteristics of tasks in solid mechanics vary in a large span, the comparison is carried out by means of simple case studies that exhibit greatly varying mechanical challenges. Several 1D examples and one 2D example with two load cases serve as a basis for the analysis.

4.1 The 1D tensile bar

The first example is based on the setup shown in Fig. 6. The bar is clamped at the left edge \(u(-1)=0\) and loaded along its entire length with the force density f(X). A Neumann boundary condition \(P(1)=T\) is applied at the right edge.

Fig. 6
figure 6

Illustration of the 1D BVP with BC: \(u(X=-1)=0\) and \(P(X=1)=T\)

Example A The following energy density is considered

$$\begin{aligned} W(F)=F^{\frac{3}{2}}-\frac{3}{2} F+\frac{1}{2} \quad { \text{ with } } \quad F=1+u^{\prime }(X). \end{aligned}$$
(20)

From Eq. (20), the first Piola–Kirchhoff stress reads

$$\begin{aligned} P=\frac{\partial W}{\partial F}=\frac{3}{2}\left( F^{\frac{1}{2}}-1\right) \Rightarrow -\frac{\partial P}{\partial X}=-\frac{3}{4} \frac{1}{\sqrt{1+u^{\prime }}} u^{\prime \prime }(X)=f(X). \end{aligned}$$
(21)

Specifically, the force density \(f(X)=X\) is chosen and the load at the free end is set to zero: \(T=0\)

$$\begin{aligned} -\frac{3}{4} \frac{1}{\sqrt{1+u^{\prime }}} u^{\prime \prime }(X)=X \quad { \text{ with } } \quad u(-1)=0, \,\, T=0 \Rightarrow u^{\prime }(1)=0. \end{aligned}$$
(22)

The example has the following analytical solution which will be used to validate the results obtained by the neural FEM methods

$$\begin{aligned} u(X)&=\frac{1}{135}\left( 3 X^{5}-40 X^{3}+105 X+68\right) \end{aligned}$$
(23)
$$\begin{aligned} u^{\prime }(X)&=\frac{1}{9}\left( X^{4}-8 X^{2}+7\right) . \end{aligned}$$
(24)

Example B A linear elastic material is investigated:

$$\begin{aligned} W=\frac{1}{2}\left( u^{\prime }\right) ^{2} \quad \Rightarrow \quad P(X)=u^{\prime }(X) \quad \Rightarrow \quad -u^{\prime \prime }(X)=f(X) \end{aligned}$$
(25)

Example B1

A single load case is analyzed in examples related to PINN and DEM.

$$\begin{aligned} -E u^{\prime \prime }(X)= f(X) = Q \cdot A \quad { \text{ with } } \quad u(-1)=0 \,\, {\text { and }} \,\, E u^{\prime }(1)=T. \end{aligned}$$
(26)

Here, distributed forces take the values \(Q = 9.395 \cdot 10^{4}\,{\rm{Nm}}^{-1}\) and \(T= 1.015 \cdot 10^{8}\,{\rm{Nm}}\). Young’s modulus corresponds to steel (\(E = 210 \cdot 10^{9}\,{\rm{Nm}}^{-2}\)) and the cross section surface is \(A = 1\,{\rm{m}}^2\).

Example B2 For the neural operator models, the PPDE is normalized and a parameterization of the force density f as well as a parameterization of the Neumann boundary condition are studied. Since the boundary only consists of one point, the boundary condition can be described by a scalar \(\pi _2\) for which a uniform distribution between [0, 1] is assumed. The reference solutions are computed using FEniCS. The BVP is described by

$$\begin{aligned} -\frac{\partial ^{2} u}{\partial X^{2}}=f(X) \quad { \text{ with } } \quad \left\{ \begin{array}{c} u(-1)=0 \\ u^{\prime }(1)=\pi _{2} \end{array}\right. . \end{aligned}$$
(27)

4.2 The plate—Example C

The selected two-dimensional example deals with a plate made of a Neo-Hookean material with the energy density

$$\begin{aligned} W(\varvec{F})=\frac{\mu }{2}\left( I_{1}-2-\ln {J}\right) +\frac{\lambda }{2}({J}-1)^{2}. \end{aligned}$$
(28)

\(I_{1}={\text {tr}}\left( \varvec{C}\right)\) is the first invariant of the right Cauchy–Green deformation tensor \(\varvec{C} = \varvec{F}^T \varvec{F}\) and \({J}={\text {det}}(\varvec{F})\) the determinant of the deformation gradient. The corresponding derivatives are \(\frac{\partial {J}}{\partial \varvec{F}}={J} \varvec{F}^{-1}\) and \(\frac{\partial {\text {tr}}\left( \varvec{F}^{\top } \varvec{F}\right) }{\partial \varvec{F}}=2 \varvec{F}\). The symbols \(\lambda\) and \(\mu\) denote the Lamé constants.

For the energy density, Eq. (28), the 1st and 2nd Piola–Kirchhoff stress tensor are given by:

$$\begin{aligned} \varvec{P} =\frac{\partial W}{\partial \varvec{F}}=\mu \varvec{F}+(\lambda \ln {J}-\mu ) \varvec{F}^{-\top } \quad {\text { and }} \quad \varvec{S} =\varvec{F}^{-1} \cdot \varvec{P}=\mu \varvec{I}+(\lambda \ln {J}-\mu ) \varvec{C}^{-1}. \end{aligned}$$
(29)
Fig. 7
figure 7

Geometry of the 2D BVP

The studied example is shown in Fig. 7. The plate is clamped at the left edge, and the Neumann boundary conditions are set at the right edge. The components of the kinematic fields in Cartesian coordinates are given by

$$\begin{aligned}{}[\varvec{u}]=\left[ \begin{array}{l} u_{x} \\ u_{y} \end{array}\right] \quad \Rightarrow \quad [\varvec{F}]=\left[ \begin{array}{ll} F_{x x} &{} F_{x y} \\ F_{y x} &{} F_{y y} \end{array}\right] =\left[ \begin{array}{cc} 1+u_{x, x} &{} u_{x, y} \\ u_{y, x} &{} 1+u_{y, y} \end{array}\right] . \end{aligned}$$
(30)

From Eq. (29), the first and second Piola–Kirchhoff stress tensor are calculated as follows

$$\begin{aligned} {[}\varvec{P}]&=\left[ \begin{array}{cc} P_{x x} &{} P_{x y} \\ P_{y x} &{} P_{y y} \end{array}\right] =\mu \left[ \begin{array}{cc} F_{x x} &{} F_{x y} \\ F_{y x} &{} F_{y y} \end{array}\right] +\frac{\lambda \ln {J}-\mu }{{J}}\left[ \begin{array}{cc} F_{y y} &{} -F_{y x} \\ -F_{x y} &{} F_{x x} \end{array}\right] \quad ,\end{aligned}$$
(31)
$$\begin{aligned} {[}\varvec{S}]&=\left[ \begin{array}{ll} S_{x x} &{} S_{x y} \\ S_{y x} &{} S_{y y} \end{array}\right] =\left[ \begin{array}{cc} \mu &{} 0 \\ 0 &{} \mu \end{array}\right] +\frac{\lambda \ln {J}-\mu }{{J}^{2}}\left[ \begin{array}{cc} C_{y y} &{} -C_{x y} \\ -C_{y x} &{} C_{x x} \end{array}\right] . \end{aligned}$$
(32)

Moreover, an equivalent stress is calculated as in [8] and used in contour plots (Sect. 5.1.2) \(S_{E}=\sqrt{0.5 \left( \left( S_{x x}-S_{y y}\right) ^{2}+S_{x x}^{2} + S_{y y}^{2}\right) +3 S_{x y}}\). Two load cases are investigated for the setup described.

Example C1 The first load case deals with the vertical load \(\overline{\varvec{T}}=-5 \varvec{e}_{y}\).

Example C2 The second load case is uniaxial tension with \(\overline{\varvec{T}}=50 \varvec{e}_{x}\).

4.3 Error measure

With the solution operator of the PPDE \(G: \mathcal {Y} \rightarrow \mathcal {S}\) and its NN approximation \(G_{\theta }\), the average relative \(L_{2}\) error for the N test data sets is calculated for the neural operator methods [6, 11, 46] as:

$$\begin{aligned} \varepsilon _{{\rm{rel}}} = \frac{1}{N} \sum _{j=1}^{N} \frac{||G_{\theta }(\varvec{y}_{j})-G(\varvec{y}_{j})||_{L_{2}}}{||G(\varvec{y}_{j})||_{L_{2}}}. \end{aligned}$$
(33)

With the neural FEM, only one concrete BVP can be analyzed at once. Then, the relative \(L_{2}\) error is calculated based on the solution for the displacement field

$$\begin{aligned} \varepsilon _{{\rm{rel}}}=\frac{||\varvec{u}_{\theta }-\varvec{u}||_{L_{2}}}{||\varvec{u}||_{L_{2}}}. \end{aligned}$$
(34)

In both cases, the determination of the relative \(L_{2}\)-error requires the computation of the \(L_{2}\)-norm which is approximated by the discrete \(L_{2}\)-norm. On an equidistant lattice \(\left\{ \varvec{X}_{i}^{{\text {equi}}}\right\} _{i=1}^{N}\), the discrete \(L_{2}\)-norm is calculated as

$$\begin{aligned} ||\varvec{f}||_{L_{2, d}}^{2}=\Delta V \sum _{i=1}^{N}||\varvec{f}(\varvec{X}_{i}^{{\rm{equi}}})||_{2}^{2}=\Delta V \sum _{i=1}^{N} \sum _{j=1}^{d} f_{j}^{2}(\varvec{X}_{i}^{{\rm{equi}}}), \end{aligned}$$
(35)

with the volume (in 2D: surface area) of each lattice unit \(\Delta V\).

4.4 Numerical integration

The discrete \(L_{2}\)-norm is based on a simple Riemann sum with error order \(\mathcal {O}(\Delta V)\). For the approximation of the risk functional, e.g., in connection with the calculation of the potential energy in the DEM, other integration methods have to be considered. Two classical methods are the Monte Carlo (MC) integration and the trapezoidal rule. The trapezoidal rule requires partitioning of the integration domain into polytopes. In the simplest case, these would be hypercubes on an equidistant grid. In [29], an integration method based on the Delaunay triangulation is proposed. The trapezoidal rule still applies, where \(\bar{f}_{i}\) is the average value over the i-th simplex (e.g., triangles). The two polytopes for integration in 2D are shown as examples in Fig. 8. Let V be the volume of the integration domain and \(\bar{f}_{i}\) the average of f over the corners of the i-th polytope with \(i \in [1, N]\) and N the number of vertices. Then, the integral approximations are generally given by Eq. (36a) for Monte Carlo and Eq. (36b) for the trapezoidal rule.

$$\begin{aligned} I_{{\rm{MC}}}(f) =\frac{V}{N} \sum _{i=1}^{N} f\left( \varvec{X}_{i}\right) \quad {\text {(a)}}{} & {} \qquad I_{{\rm{T}}}(f)=\sum _{i=1}^{N} \bar{f}_{i} \,\, \Delta V_{i} \quad {\text {(b)}}. \end{aligned}$$
(36)
Fig. 8
figure 8

Examples for polytopes

Three methods are investigated to select the grid points: equidistant grid points, pseudo-random numbers and quasi-random numbers (Latin hypercube sampling, LHS). Exemplarily, we compare the absolute error of the potential energies in the nonlinear 1D setup (Example A). The trapezoidal rule with 100 000 grid points is assumed as a quasi-exact comparison value. The MC integration with equidistant grid points reduces to a simple Riemann sum. The results for 100 and 1000 grid points, respectively, are summarized in Fig. 9. A characteristic distribution of grid points is shown in Fig. 10.

Fig. 9
figure 9

Comparison of Monte Carlo (MC) integration and the trapezoidal rule with \(n=100\) and \(n=1000\) grid points for equidistant sampling (equi), pseudo-random uniform sampling (unif) and Latin hypercube sampling (LHS)

Fig. 10
figure 10

Quasi-random (top), pseudo-random (bottom) and equidistant (middle) grid points for numerical integration (1D case)

Due to the larger integration error, uniform pseudo-random sampling is not considered further. In the following, "random" sampling always refers to quasi-random LHS. Figure 9 also shows that the trapezoidal rule is consistently more accurate than MC integration.

4.5 Technical implementation

In this work, we use PyTorch (version: 1.11.0), which contains the optimizer L-BFGS that is employed in some of the investigated methods. The computations have been run on an Intel Core i5-7200U mobile processor. A mobile NVIDIA GeForce GTX 950 m is used as graphics card.

The parameters of the NN are always randomly initialized, so that the results underly statistical variations.

5 Results and discussion

The conventional PINN has been investigated by means of Examples A and C. Significant properties of cPINN are illustrated with Example A. The DEM has been executed on all examples (A, B and C). The neural operator models are tested on Example B2.

5.1 Neural FEM

5.1.1 PINN

Example A In the numerical experiments related to Example A, the NN architecture [1, 10, 1] is always used. Two optimizers (L-BFGS, SGD), different numbers of collocation points (100 and 1000 points) and different computational accuracies (single precision FP32 and double precision FP64) are compared (Fig. 11).

The SGD is repeated for 10,000 epochs and the L-BFGS for 15 epochs, each with the default parameters of the methods. The discrepancy in the required number of epochs is reflected by the run time, which is about 43 s for the SGD compared to about 0.300 s for the L-BFGS. Obviously, a second-order method (such as the L-BFGS) can greatly reduce the number of necessary iterations. In each epoch, the complete data set is used (Full-Batch). Despite the same information being provided to the NN, the L-BFGS method performs better on average than the SGD. Figure 11 also shows that the reduction in total error relative to the number of collocation points quickly goes into saturation. The difference between \(N=100\) and \(N=1000\) is only about \(6.500 \cdot 10^{-6}\) for the L-BFGS.

Moreover, no significant increase of the total error measured in the relative \(L_{2}\) norm is shown when computing on single precision. This can greatly reduce the computation time on commercially available graphics cards that are optimized for single precision computing. However, this needs to be confirmed in the further research for more complex problems.

The best results were obtained with the tangent hyperbolic (Tanh) activation function. Other activation functions, such as the rectified linear unit (ReLU) or the exponential linear unit (ELU), do not converge or converge very poorly against the analytical solution of the problem. The calculated displacements for different activation functions are comparatively shown in Fig. 12. ReLU and ELU could not be optimized with L-BFGS. Therefore, only the results after optimization with Adam are shown. The second derivative of the approximation with ReLU activation is everywhere constantly zero (except for the point at the kink). This destroys the information in the residual and a training of the network must necessarily fail. Hence, for the following studies, Tanh was always applied as the activation function. Other activation functions are not considered in the present contribution. However, in the literature, the composite function \(\max \left( 0, x^{3}\right)\) [27], the Swish activation \(z S(\beta z)\) (where S denotes the sigmoid activation function) [48] and GELU [6] have been successfully employed.

The training of PINN with SGD took about 40 s for 10,000 epochs.

Fig. 11
figure 11

Relative \(L_{2}\) error and run times in s for two optimization methods L-BFGS, SGD with 100 and 1000 collocation points and FP32/ FP64 accuracy

Fig. 12
figure 12

Calculated displacements with 100 collocation points and different activation functions

Example C Among the conventional PINN representatives, the DCM is the easiest to implement for 2D problems and thus chosen to apply to Example C (Sect. 3.2.1).

Within the domain, the balance of linear momentum reads \(\nabla \cdot \varvec{P}=\textbf{0}\), which already is a residual form for approximations of \(\varvec{P}\). The Neumann boundary conditions are given by \(\varvec{P} \cdot \varvec{N}=\overline{\varvec{T}}\), the Dirichlet constraints are incorporated directly by the application of a transformation on the output of the NN (Eq. (9)). The architecture of the network is specified as [2, 30, 30, 2], and on the Neumann boundary part, 900 random collocations points are chosen. 4000 collocation points are used within the body. L-BFGS with learning rate 1.0 and line search with Wolfe condition is applied as optimizer.

Other than the DEM, the DCM does not converge toward the reference solution for load case C1. A comparison with the DEM shows that the boundary conditions are not appropriately learned by the DCM.

Wang et al. [30] discuss that the training of PINNs may fail due to numerical inaccuracies if the contributions to the loss value—one portion from the residual and the other portion from the Neumann boundary part—or their gradients w.r.t to the NN parameters are in vastly different orders of magnitude. In the case of Example C, the loss value in the DCM consists of the portion from the residual with the value 0.119 and the portion from the Neumann boundary part with the value 1.121. The gradients of each contribution w.r.t. the NN parameters are relatively uniformly distributed (Fig. 13). Moreover, the correct boundary conditions are not learned even if the loss portion of the residual is excluded (manually set to 0).

Fig. 13
figure 13

Histogram over the gradients of empirical risk. First hidden layer (left), output layer (right)

Hence, the failure of the DCM on the example cannot be explained by this kind of numerical inaccuracies. It is possible that the optimizer gets stuck in a local minimum, but this behavior needs a further investigation.

5.1.2 cPINN

Example A For many applications, the accuracy that can be achieved with a classical PINN is not sufficient. cPINN was developed to improve the accuracy by avoiding the squaring of the residual [9, 49]. Instead, it employs adaptive competitive gradient descent (ACGD) as optimization procedure whose Python implementation is publicly available [50]. Furthermore, Tanh in the PINN part and ReLU in the discriminator network are chosen as the activation functions. The architecture of the PINN is chosen with 10 neurons in the hidden layer (architecture: [1, 10, 1] ) for comparability with the conventional PINN/ DCM. The layer width h of the discriminator, on the other hand, was varied (architecture: \(\left[ 1, 20, 2\right]\) vs. \(\left[ 1, 50, 2\right]\)). Initial experiments have shown that the output of the discriminator \(\varvec{d}_{\phi }\) must be separated for points in the domain and on the boundary \(\varvec{d}_{\phi }=\left( \varvec{d}_{\phi }^{B}, \varvec{d}_{\phi }^{\Gamma }\right)\).

Therefore, the output layer contains 2 neurons. The option of separating both outputs of the discriminator into independent subnetworks is also tested. However, this did not result in any improvement. Based on these results, only the first variant with the smaller number of NN parameters is considered further. All calculations are performed with double precision.

Fig. 14
figure 14

Relative \(L_{2}\) errors and run times in s of cPINN with 100 and 1000 collocation points and 20 and 50 neurons in the hidden layer of the discriminator

Figure 14 shows the results for accuracy and run times, from 100 runs with random NN parameter initializations in form of a box plot. It can be seen that the accuracy is only moderately affected by the number of collocation points. However, the width of the discriminator has a significant impact on the training result. Increasing the number of neurons in the hidden layer from 20 to 50 reduces the error by an order of magnitude. It is not entirely clear why such a large discriminator network is necessary. Moreover, the training is relatively slow, taking about 2 min (up to 3 min for 1000 collocation points) for about 6000 epochs.

The improvement of up to 2 orders of magnitude reported by [9] cannot be demonstrated here. This may be because the pathologies related to training PINNs [30, 31], which cPINN addresses, do not arise in this simple example. On the other hand, the regularization by the residual leads to a complex energy landscape of the optimization procedure, making optimization more difficult [31]. Moreover, the material law sometimes causes the optimization process to abort if the network parameters have been initialized unfavorably. This problem can be solved by reducing the range for random sampling of the initial parameter values.

In addition, the different weighting of the summands of the empirical risk can lead to different magnitudes of the gradients of the loss function w.r.t. the NN parameters, which can impair the NN parameter optimization [12, 30]. In the present case, the gradients for the residual within the domain \(\nabla _{\theta } F^{{\rm{PINN}}, B}\) are much larger than the gradients for the constraints \(\nabla _{\theta } F^{{\rm{PINN}}, \Gamma }\). The optimization procedure is therefore driven more strongly toward a solution that reduces the residual while allowing for deviations from the constraints. As a result, the optimization procedure converges toward a plausible solution, but one that does not satisfy the boundary conditions. For the complex architecture [1, 50, 50, 1], the gradients of the residual and boundary portion of the loss function w.r.t. the NN parameters of the first hidden layer and the output layer are shown in Fig. 15. However, not much discrepancy is detected in the order of magnitude of the gradient values for both portions of the loss function.

Fig. 15
figure 15

Histogram of the gradients of the optimizable parameters. First hidden layer (left), output layer (right)

5.1.3 DEM

With the architecture [1, 10, 1], DEM training runs on average twice as fast compared to classical PINN. For deeper networks, this fact is amplified, as will be shown in the analysis of Example C. Our own DEM implementation is based on the public source code [8] and is extended to allow training with Monte Carlo integration on quasi-random grid points. Tanh is used as the activation function.

Example A We enforced the geometric boundary conditions by the transformation

$$\begin{aligned} u(X)=(1+X) z_{\theta }(X), \end{aligned}$$
(37)

where \(z_{\theta }(X)\) is the output of the NN. This way, always zero displacement is calculated at the clamped end (\(X = -1\)). A comparison of the resulting relative errors in displacements and strains as well as the run times is shown in Fig. 16.

Fig. 16
figure 16

Relative \(L_{2}\) errors and run times of DEM for trapezoidal rule (T) and Monte Carlo integration (MC) on 100 and 1000 grid points each

Two integration methods (Monte Carlo integration and trapezoidal rule) are compared, each on different sets of randomly selected support points (100 and 1000, respectively). For the trapezoidal rule, the support points are sorted and the boundary points are explicitly considered. The optimizer is the L-BFGS with learning rate 1.000. This proves to be very efficient and approaches the solution after only 15 epochs.

The results furthermore illustrate that the use of single floating point accuracy (FP32) leads to only a slight decrease of accuracy, similar as seen with the conventional PINN. However, the run time even increases with FP32, what indicates slower convergence of the optimization procedure.

Example B The study of Example B reveals a pathology of the DEM that has not appeared in Example A. Its effect can be seen in Fig. 17. This pathology can be attributed to overfitting [10], since the potential energy, unlike the squared residual, has no regularizing effect. In fact, early stopping significantly reduced the influence of overfitting. Alternatively, overfitting could be avoided by increasing the number of grid points. For 1000 grid points, hardly any overfitting occurred without a need for early stopping.

Fig. 17
figure 17

Displacement results of DEM compared to FEM. Displacements with overfitting (left), Absolute error in displacements (right); rel. \(L_2\) error: \(1.358{\rm{e}}+00\)

Example C The approximation of the energy functional can be done analogously to the 1D example by means of different integration techniques. However, the Monte Carlo integration is the simplest option to implement.

The incorporation of the boundary conditions, network architecture and optimizer algorithm as well as the load case are chosen similarly as for the DCM (Sect. 5.1.1). 10 000 collocation points within the body bulk are selected. The external energy is evaluated by using 200 random points on the right edge of the plate. The results for load case C1 and error are presented in Figs. 18 and 19, respectively.

Fig. 18
figure 18

Calculated displacements and equivalent stresses for the vertical load case

The relative \(L_{2}\) error in the displacements is only 0.002 and in the equivalent stress 0.053. Figure 19 shows that the error in the equivalent stresses is concentrated at the restraint. The stress peaks at the critical points are not completely resolved by the NN.

Fig. 19
figure 19

Absolute error in displacements and equivalent stresses for the vertical load case

The integration method does not cause this error, which is confirmed by a second calculation that uses trapezoidal rule. The relative \(L_{2}\) errors in the case are 0.003 for displacements and 0.039 for equivalent stresses. Again, an equidistant grid with 10 000 collocation points is used. No significant improvement is obtained by the more accurate integration procedure, so the influence must be considered small.

For load case C2, the same procedure with MC integration is carried out. The results and errors are presented in Figs. 20 and 21, respectively. The relative \(L_{2}\) errors are 0.005 and 0.019 in this case.

In conclusion, the NN is able to approximate the character of the solution of the BVP. However, relatively large errors are found at the restraint. According to [29], the same problems arise for a PINN trained with the squared residual. The resolution of fine features of the stress and displacement fields seem still to be a challenge for future work.

Fig. 20
figure 20

Calculated displacements and equivalent stresses for uniaxial tension

Fig. 21
figure 21

Absolute error in the calculated displacements and equivalent stresses for uniaxial tension

5.1.4 Transfer learning (TF)

One possibility to enhance the performance of neural FEM is by means of transfer learning, e.g., in case of a varying Neumann boundary condition. This is illustrated on Example B2, where \(\pi _{2}\) is changed by only a small amount in each iteration. Then the NN trained on the previous \(\pi _{2}\) value is already a good approximation for its subsequent value. Hence, the NN parameter values can be copied to the new NN to reduce the number of required learning epochs.

Applied to Example B2 (linear elastic material), the average run time could be reduced by a factor of four. Similar time savings have been documented in [28] in the study of plastic deformations. The relative \(L_{2}\) errors are shown in Fig. 22. The average \(L_{2}\) error is \(3.145 \cdot 10^{-5}\), which is about an order of magnitude smaller than the error with DeepONet. The training duration is reduced from 23 s to about 5 s.

Fig. 22
figure 22

Effect of transfer learning (TF). Comparison of \(L_{2}\) errors over the parameter space

5.1.5 Initialization of a conventional FE solver

As indicated in the literature and the results above, relative \(L_{2}\) errors of approx. \(10^{-4}\) are usually achieved. The computed solution could then be submitted as initialization to a traditional FEM solver in order to improve the accuracy.

Applied to Example A, a FEniCS calculation that is conventionally initialized with all displacements as zero runs for four iterations. With the solution of the DEM with trapezoidal rule and 1000 collocation points as initialization values for the FEM solver, the calculation is accelerated by a factor of two, only half of the iterations until convergence are needed.

Thus, neural FEM results can be employed as a potential way to speed up an FEM simulation in settings where the neural FEM is not yet able to completely replace the FE simulation.

5.2 Neural operators

The operator methods are examined on Example B2 (Sect. 4.1)—a tensile bar with clamping restraint at the left side and a free end at the right side. In this simple test case, analytical solutions are available to calculate the error for the parameterization of the Neumann boundary condition in case of fixed force density \(f(X) \equiv 1\). Combinations of both varying the force field and the boundary conditions have not been carried out in the present work.

Training and test data for the varying force field are generated by means of Gaussian processes with squared exponential covariance (correlation length \(l=0.100\)). The free end of the bar at the right side yields the parameter \(\pi _{2} = 0\). For training, 1000 data sets and for the tests 100 data sets have been generated with FEniCS on an equidistant grid for \(X \in [-1,1]\) with 1024 grid points and quadratic shape functions. The relative \(L_{2}\) errors of the FEM simulation are several orders of magnitude smaller than expected from the NN methods (displacements approx. \(10^{-10}\); strains approx. \(10^{-7}\)) and should not influence the survey. One of the resulting data sets is illustrated in Fig. 23.

Fig. 23
figure 23

FEM solution for a random realization of the force density f

5.2.1 DeepONet and PIDeepONet

Numerical setup For the DeepONet, the data sets need to be preprocessed since the neural operator methods work with P random collocation points that change between the evaluations of the loss function, whereas the reference solutions are produced be FEM on a fix mesh with m equidistant points. Hence, the realizations of the force fields are projected onto the FE mesh. Then, the FE results are interpolated and evaluated at the random collocation points. The parametrization of Neumann boundary condition has only been investigated with DeepONet, where \(\pi _{2} \sim U[0,1]\) is assumed.

The source codes for DeepONet and PIDeepONet have both been published on Github [51, 52]. The architectures for the subnets were specified as [20, 100, 100] for the branch net and [1, 100, 100] for the trunk net. The branch is set up with layer width \(m=20\). The chosen activation function are ReLU in DeepONet and Tanh in PIDeepONet. A similar architecture has also been suggested in [11, 12]. The L-BFGS optimizer with line search (strong Wolfe condition) and learning rate 1.0 is applied as suggested in [31], for 120 epochs. However, L-BFGS is very memory consuming, so \(historysize =50\) (default: 100) is set.

1000 load cases are used for training and 100 for testing. In order to reduce the training effort, from the 1024 grid points only 8 are randomly chosen per load case. Hence, the training data set size reduces to 8000 data points. The error reduction of considering more grid points per load case quickly saturates so that this reduction is admissible. Approximately 50 random points can be estimated as the saturation limit for 1000 training data sets [11]. Training is conducted for both methods, Full-Batch and Mini-Batch (batch size: 1000).

Since PIDeepONet includes the whole information about the PPDE (similar to the PINN models) in the loss function, no reference solutions by means of FEM are necessary. In exchange, the loss functions needs to be constructed anew for each PDE.

For the DeepONet, additionally an alternative loss function is investigated. It employs the relative \(L_{2}\) error instead of the mean square error—MSE (Eq. 5).

Results Representative results for the displacement u and strains \(u'\) over the bar length obtained with DeepONet and PIDeepONet are shown in Fig. 24. Both methods match the displacements relatively well, but DeepONet has visible deviations in the strains. In particular, the non-smooth curve of the strain, which is the spatial derivative of the displacement, can be attributed to the ReLU activation function, which has a discontinuous derivative.

Fig. 24
figure 24

Displacements and strains with corresponding errors for an exemplary element of the test data set with DeepONet

The errors for the displacements and strains are shown in Table 1. The usage of the residual in the empirical risk in PIDeepONet improves the accuracy in the strains about one order of magnitude compared to DeepONet. The effect of floating point accuracy on \(\varepsilon _{{\rm{rel}}}\) is small, similar as seen for neural FEM. Overall, training with Full-Batch on FP32 performs best.

The runtimes of models, each trained with Full-Batch and Mini-Batch (batch size 1000), are compared in Table 2. The run times for the DeepONet are significantly lower than for the FNO with comparable accuracies. Using the relative \(L_{2}\) error instead of the MSE reduces the convergence rate, requiring more iterations and increasing the run time. The difference between the mean and median is reduced, but no significant effect on the error is found.

Table 1 Mean error values over the test data set with DeepONet and PIDeepONet
Table 2 Run times of the optimization procedure with Full-Batch and Mini-Batch (1000 elements), respectively

Figure 25 shows the loss histories from the optimization with the DeepONet and PIDeepONet, respectively. Overall, the relative \(L_2\) error on the test data set is smaller for Full-Batch training. The difference in resulting accuracy between the two methods can be attributed to the training error alone. PIDeepONet converges significantly slower and yields worse accuracy than DeepONet. With Full-Batches, PIDeepONet even converges to a local instead of the global minimum. The poor convergence of the PIDeepONet demonstrates the significantly more complex optimization task, where the DeepONet makes use of the explicitly obtained FEM results as training data sets.

Fig. 25
figure 25

Training loss histories after optimizing PIDeepONets with Full-Batch and Mini-Batch

Potential energy in the loss function The results of neural FEM (Sect. 5.1.2) and [48, 53] suggest that replacing the squared residual in the loss function of PIDeepONet by the potential energy can make the optimization problem easier to solve. Hence, such a method should be more robust and efficient and make training feasible even where training of PINNs fails. However, with 100 random realizations of the force field and 15 collocation points per realization as suggested in [48], this method does not converge for Example B2.

Influence of initialization The default initialization by PyTorch sets the weights and bias by randomly sampling from a uniform distribution. For \(\varvec{W} \in \mathbb {R}^{N_{k} \times N_{k-1}}\) and \(\varvec{b} \in \mathbb {R}^{N_{k}}\), it holds:

$$\begin{aligned} W_{i j}, b_{i} \sim U[-\sqrt{k}, \sqrt{k}] \quad { \text{ with } } \quad k=\frac{1}{N_{k-1}}, \end{aligned}$$
(38)

where \(N_{k}\) denotes the width of the k-th layer. Glorot and Bengio [54] and Wang et al. [12] suggest that the convergence of the NN can be accelerated by the Glorot initialization. Let \(N\left[ \mu , \sigma ^{2}\right]\) be a normal distribution with mean \(\mu = 0\) and variance \(\sigma ^{2}\). This yields

$$\begin{aligned} b_{i}=0 \quad { \text{ and } } \quad W_{i j} \sim N\left[ 0, \sigma ^{2}\right] \quad {\text { with }} \quad \sigma =\sqrt{\frac{2}{N_{k}+N_{k-1}}} \end{aligned}$$
(39)

for the parameters.

The errors of the models on single precision with Full-Batch optimization and L-BFGS are shown in Table 3. The error of DeepONet for the displacements becomes only slightly smaller. Also, no large difference in run times was observed. The Glorot initialization was developed mainly for deep learning applications and does not have much impact on the shallow networks (with one hidden layer) used here.

Table 3 Mean error values over the test data set, with Glorot initialization

Neumann boundary parameterization The (PI-)DeepONet and the FNO have been specifically designed for approximating mappings between function spaces. One advantage of the DeepONet over the FNO is that it is very easy to apply to arbitrary parameterizations. For example, the PIDeepONet can be used to parameterize the Neumann boundary. Let again be given the dimensionless problem

$$\begin{aligned} -u^{\prime \prime }(X)=1 \quad { \text{ with } } \quad u(-1)=0, \,\, u^{\prime }(1)=\pi _{2}. \end{aligned}$$
(40)

The analytical solution is given by

$$\begin{aligned} u(X)=\frac{3}{2}+X-\frac{1}{2} X^{2}+\pi _{2}(X+1). \end{aligned}$$
(41)

In the following, only the scalar variable \(\pi _{2}\) needs to be varied. The solution operator \(G: I \rightarrow \mathcal {S}\) is sought, where \(I=[0,1]\) is fixed and \(\mathcal {S}\) denotes the space of admissible deformations.

For each of the subnetworks, a hidden layer with 50 neurons is used. Their architectures are thus given by [1, 50, 50]. Tanh is used as the activation function according to the experience in neural FEM. The NN is trained over 40 epochs with L-BFGS on 10 000 training data set entries and 1000 validation data set entries.

The data set is built by selecting a single random collocation point for each of the 100 realizations of \(\pi _{2}\) (\(P=1\)) and 1000 grid points each on the interval \([-1,1]\). No early stopping is used. The relative \(L_{2}\) error over the whole interval I is shown in Fig. 26.

Fig. 26
figure 26

Distribution of the relative \(L_{2}\) error over the parameter space for displacements u and strains \(\varepsilon\)

The training of the mesh took approximately 118 s. The mean relative \(L_{2}\) error for the displacements is \(1.695 \cdot 10^{-4}\) and \(2.463 \cdot 10^{-4}\) for the strains. For the calculation of the complete test data set, the PIDeepONet took 0.033 s. This highlights the difference between the short run time of the inference and the large computational effort for the training.

With L-BFGS and 1000 collocation points (to minimize the risk of overfitting), the training of the DEM on 100 realizations took about 23 s. The higher training effort of the operator model is profitable only for about 500 realizations of \(\pi _{2}\) and more.

5.2.2 FNO and PINO

The FNO architecture proposed in [6] is applied for Example B2. Here, the hyperparameter representing the hidden layer width \(d_v = 64\) (Sect. 3.3.2) results in 549 569 NN parameters. Secondly, a smaller architecture with \(d_v = 12\) is set up which results in 20 885 NN parameters. This is comparable to the DeepONet architecture with 22 500 parameters.

Padding was considered as suggested in [6], since the considered example has non-periodic boundary conditions in the input functions.

Gaussian error linear unit (GELU) is used as the activation function,

$$\begin{aligned} {\text {GELU}}(x) = x \Phi (x)=\frac{x}{2}\left[ 1+{\text {erf}}\left( \frac{x}{\sqrt{2}}\right) \right] , \end{aligned}$$
(42)

where \(\Phi (x)\) denotes the standard normal distribution.

Adam with decreasing learning rate (initial value 0.001, reduction factor 0.5 every 50 epochs) and weight decay \(\lambda = 10^{-4}\) is used as optimizer. The other optimizer parameters are kept as the PyTorch default values. The relative \(L_{2}\) error is used for the loss function. Similar to the DeepONet, a training data set with 1000 entries and a test data set with 100 entries are used, 500 training epochs are carried out.

In the present work, the spatial derivatives in the loss value for the elastic strain energy and body forces are approximated by a second-order central difference method instead of employing the autograd feature. This can significantly reduce the computational effort, since the number of parameters in the NN is usually much greater than the number of grid points. Exact derivation methods are discussed in more detail in [13].

The results for an exemplary test load case are shown in Figs. 27 and 28. The absolute errors in the displacements computed by FNO and PINO are relatively similar, but the strains at the endpoints of the bar as computed by the pure FNO show significant errors, rendering the solution practically unusable. The mean errors in the displacements and strains on the whole test data set are shown in Table 4. The inclusion of potential energy does not yield a significant effect on the accuracy for the displacements. The unphysical oscillations of the calculated strains at the edges of the computational domain are eliminated (Fig. 28). The only drawback of PINO is the discretization dependence of the numerical derivative. After training, the error is no longer constant over different discretization levels which is analyzed in Sect. 5.2.3.

Fig. 27
figure 27

Displacements calculated with FNO and PINO (for an exemplary load case of the test data set)

Fig. 28
figure 28

Strains calculated with FNO and PINO (for an exemplary load case of the test data set)

Table 4 Mean and median values for errors in displacements and strains from FNO and PINO with \(d_{v}=64\) and \(d_{v}=12\)

The errors are concentrated at the edge, which indicates that the choice of boundary conditions for the FNO might be unfavorable, although this approach should be able to map non-periodic boundary conditions when padding data arrays with zeros. For comparison, a new data set with periodic force fields and a bar clamped on both sides has been modeled. For this purpose, the data set for the 1D-Burgers problem is assumed [6]. The periodic initial conditions are interpreted as force fields. The results that yield the largest relative \(L_{2}\) error among the test data set are shown in Fig. 29 for the displacements and strains.

Fig. 29
figure 29

Displacements, strains and errors calculated with FNO for the Burgers data set (load case with highest \(L_2\) error within the test data set)

The reason for the poor agreement at the boundaries cannot be attributed solely to the non-periodic boundary conditions of the input functions, although for the periodic boundary conditions, the median of the \(\varepsilon _{{\rm{rel}}}\) error is lower for the periodic boundary conditions (\(2.396 \cdot 10^{-2}\)) than for the non-periodic ones (\(9.278 \cdot 10^{-2}\)). Large deviations at the boundaries of the periodic domain are still present (Fig. 29). These errors can be reduced by regularization (for example by means of an energy functional), as observed with the neural FEM.

The average run times per epoch for computations on CPU and GPU are shown in Table 5. The midrange mobile graphics in use accelerated the calculation by factor 4 for the 32 Bit floating point accuracy. For a comparable acceleration for FP64, instead of consumer graphics cards specific high performance computing (HPC) accelerator cards are necessary. We skip a calculation of the \(\varepsilon _{{\rm{rel}}}\) errors with double precision because no large influence on the total error can be expected, according to the results of the neural FEM.

Table 5 Average runtimes per epoch of FNO and PINO on FP32 and FP64 with \(d_{v}=64\) and \(d_{v}=12\)

5.2.3 Zero-shot super resolution

A main feature of neural operators is the consistency of the numerical error over different discretization levels. This is manifested by a nearly constant progression of the error across the discretization level, as shown in Fig. 30. Therefore, zero-shot super resolution becomes possible, which means that the NN can be evaluated on a finer grid than the one that used for training. In the present work, the NNs are trained on a data set based on FEM solutions with a grid with 1024 nodes, but can also be evaluated on finer discretization levels with the same error. The only exception is the PINO which uses a finite difference method in the optimization process in the current contribution. The reference solution for the finer discretization task is obtained from an FEM analysis with 8192 node points. This data set is cubically interpolated to all other discretizations for comparison with the results of the NNs.

Fig. 30
figure 30

Relative \(L_2\) error on a single instance of the test data set with varying discretization used for NN evaluation

6 Conclusions

In this work, different NN methods have been analyzed and applied to examples from elastostatics. Specifically, a 1D tensile bar with a hyperelastic material (Example A, Sect. 4.1) and with a linear elastic material (Example B, Sect. 4.1) has been investigated. Moreover, a plate made of a Neo-Hookean material has been studied for two load cases, vertical loading and uniaxial tension (Example C, Sect. 4.2).

6.1 Summary

A reliable estimate of the potential of NN-based methods for problems in elastostatics requires comparison with conventional methods with regard to the accuracy. Typically, the relative error for the standardly applied FEM amounts to \(10^{-6}\) or less in terms of strains, often only limited by the trade-off with computational effort and the floating point accuracy of the computing system. In the FEM, the influence of the mesh coarseness is examined separately in a convergence analysis, indicating that the number of nodes and integration points can be significantly lower than for the neural methods.

The numerical values obtained by the open source FEM code FEniCs for Example B are considered as the reference solution. It exhibits a relative MSE of \(7.325 \cdot 10^{-10}\) in terms of displacements and \(8.716 \cdot 10^{-7}\) in terms of strains.

6.1.1 Summary of PINN performance

Table 6 summarizes the results of the PINN-based methods for Example A. The combination of parameters and numerical subprocedures yielding the most accurate results is chosen for each method.

Table 6 Performance summary for PINN methods on Example A, FP64

In the basic form of classical PINN [7], the empirical risk is built from the squared residuals of the differential operators. In various works [30, 31, 55], it has been shown that such a PINN is difficult or impossible to train even for simple examples, so alternative forms of regularization have been developed. The present work particularly studies the DEM, based on the principle of minimal potential energy, and the cPINN, based on the game-theory. The results for Example A using these three approaches (PINN, DEM, cPINN) are compared in Fig. 31. The average accuracies are relatively similar, but the comparatively long run time of the cPINN is disadvantageous.

Fig. 31
figure 31

Relative \(L_{2}\) errors and run times for Example A with the neural FEM approaches

According to [9], the relative errors can be reduced by up to 2 orders of magnitude by the cPINN, which could not be demonstrated with Example A in this work. The training of cPINN and DEM should converge in more cases than pure PINN, i.e., it is more robust, as demonstrated with Example C. Overall, the PINN performs best in Example A. However, the latter is not suitable to show the training pathologies of the PINN. Those pathologies were demonstrated only on Example C, where the training of the PINN fails, but the DEM can be applied successfully. The error measured in the \(L_{2}\) norm is relatively small. But the absolute errors of the equivalent stresses for the 2D plate in the vertical load case show deviations of up to \(77\,{\rm{Nm}}^{-2}\) at the restraint, for a maximum stress of about \(142\,{\rm{Nm}}^{-2}\).

Our results with Example C underline the suggestion from [29] that PINN and DEM as well as PIDeepONet in their present form are not able to resolve stress concentrations. Further work is necessary to find and analyze alternative approaches with improved accuracy and applicability for classical tasks in solid mechanics like the investigation of critical areas in strength analysis.

DEM In DEM, the convergence order of the integration method does not seem to be significant after reaching a certain limit of accuracy. However, DEM holds the risk of overfitting, which must be accounted for by early stopping or a sufficient number of collocation points (support points). This topic has not been addressed in the literature up to now.

An advantage from the use of the potential energy is the reduction of the order of differentiation, which also decreases the numerical effort. The run time is about a factor of six lower compared to the PINN. All studies show an intense dependency of the result from the initialization of the NN parameters. In extreme cases, the optimizations converges toward different functions. The relative \(L_2\) error of the DEM with trapezoidal rule and 1000 collocation points results in the whole range from \(4.898 \cdot 10^{-6}\) up to \(4.277 \cdot ^{-4}\) for displacements and \(3.774 \cdot 10^{-5}\) up to \(3.701 \cdot 10^{-3}\) for strains. This indicates that the expensive training has to be conducted several times, until an acceptable ML model is found.

6.1.2 Summary of neural operator performance

Results of neural operator methods for Example B2 (Table 7) suggest that the physics-informed regularization leads to the reduction of the error in the strains of about one order of magnitude. The performance of the DeepONet-based methods exceeds their Fourier neural operator based counterparts significantly. This applies to the errors in the strains as well as to the run times.

Table 7 Performance summary for neural operator methods on Example B2 with FP32

The neural operator methods have been applied to Example B. All models are calculated with single precision floating point since the investigation of DeepONet, PIDeepONet and the neural FEM did not yield significant effects on the accuracy of the results. Furthermore, the DeepONets have been optimized with Full-Batch training.

The analysis shows that the DeepONet is significantly faster than the FNO, even if calculated on a GPU. Both FNO and DeepONet can learn the solution operator of the parametric PDE, but the achieved accuracies are not sufficient in many cases.

6.2 Discussion of low accuracies of neural methods

All neural methods presented here show significantly reduced accuracies compared to the conventional FEM. However, the analysis performed in Sect. 5 indicates that the following effects do not crucially influence the accuracy:

  • Floating point accuracy,

  • Spatial integration methods,

  • Number of collocation points/size of data set,

  • Basic network and optimizer hyperparameters (number of layers, layer width, type of activation functions, learning rate, etc.),

  • Choice of optimizer,

  • Normalization of the problem.

Accordingly, the conceptional differences between FEM and DEM can be identified as root cause for the difference in accuracy: The FEM is a method that originated from the need to solve complex elasticity and structural analysis problems. The essential transformation of the boundary value problem into its weak form relaxes the requirements on the differentiability of the solution function and the shape functions at the element borders. Compared to that, the neural methods use an NN to approximate the solution on the whole domain with a general technique. That results in an NN with a significantly higher number of optimizable parameters than the equivalent FEM setup.

The Newton–Raphson method in the FEM in comparison with the adaptions of the gradient descent methods in the neural methods does not use a fixed step size or learning rate, but the optimal one. In the FEM, the calculation of the tangent stiffness matrix and the solution of the corresponding linear system is feasible, as there are less unknowns to solve for and the matrix is guaranteed to be sparse for large domains.

Lastly, NNs increase their expressivity by increasing the number of consecutive layers. This way, NN parameters contribute multiplicatively to the result and thus affect their influence interdependently. Such concept increases the complexity of the loss landscape significantly in comparison with the FEM. This issue is for example addressed in the formulation of the shallow energy method which only uses a single hidden layer.

6.3 Limitations of the current investigation and future work

The work at hand only considers a number of representatives for the two classes of methods. According to the results shown, the accuracy of NNs has to be improved for a reliable engineering application in elastostatics. However, the most recent studies, not yet included in this contribution, open new perspectives for performance improvements.

One promising approach is a network architecture based on the neural attention mechanism that is claimed to improve the accuracy of PINNs about up to two orders of magnitude [30]. A similar suggestion is made with regard to the physics-augmented learning [55]. The sequential learning and the curriculum learning are two further approaches that might improve the learning ability of PINNs [31]. Similarly, the improvements on DEM [38], Glorot initialization of the NN parameters [54] and pretraining [56] as well as the adaptations mDEM [29] and SEM [39] are some of the strategies that could significantly contribute to increase the accuracies and performances.

Based on the findings presented in Sect. 5, second-order methods for the optimization of NN parameters during the training [57, 58] should be investigated in future works. The example of L-BFGS shows that the optimization with higher order methods can be immensely accelerated. They also take profit from larger batches, which makes them more efficient in terms of data parallelism.

A more sophisticated verification of the neural methods could also requires highly specialized testing setups, such that further studies should include an increased number of more complex and three-dimensional examples.