Introduction

Biological neural networks (BNNs) have inspired today’s most successful artificial neural networks (ANNs), which consist of neurons linked through connections known as synapses. Traditionally, each synapse in such a network serves three functions: (1) storage of long-term memories in their weights (W), (2) synaptic transmission - modeled as input-weight multiplication, and (3) long-term plasticity - the update of W during training.

However, these ANN synapses only capture a subset of the functionalities of biological ones. The latter follow complex biophysical dynamics and learning rules such as Hebbian plasticity1 and short-term plasticity2,3 (Fig. 1a). Additionally, higher-order plasticity rules exist that do not directly determine the synaptic weight, but rather the properties of the plasticity rule itself. One example is the control over the decay timescale of the short-term plasticity rule, which can range from milliseconds to minutes, depending on the neuronal activation2,4. These rules, known as meta-plasticity5,6, play a crucial role in demanding tasks that require not only learning but also learning-to-learn, i.e., meta-learning7,8,9,10,11,12.

Fig. 1: Biologically inspired synaptic functions and their memristor implementation.
figure 1

a Organization of the mammalian brain with several biological neurons connected through synapses. When a postsynaptic spike (light blue) coincides with a presynaptic spike (light green) the corresponding synaptic coupling is strengthened (Hebbian plasticity) for a limited amount of time (short-term plasticity). This bio-physical process is illustrated in the circular insets: (I) an influx of ions (e.g., Ca2+) through the postsynaptic voltage gated ion channels leads to (II) an increased number of synaptic receptors, which increases the synaptic weight. (III) The weight subsequently decays back to its original value due to the receptors gradually detaching from the membrane. b Table comparing the synaptic functions of artificial synapses in standard ANNs (Artificial column) and biological synapses (Biological column). The plot on the right shows the weight of a biological synapse as a function of time. The short-term weight (F) is updated (ΔF) when the pre- and post-synaptic spikes coincide. Additionally, the decay time of F can be controlled, which corresponds to meta-plasticity. c Bio-inspired Short-Term Plasticity Neuron (STPN) model combining a conventional neuron model with short-term Hebbian (ST-Hebb) synapses. d Hardware implementation of a neuromorphic ST-Hebb synapse with a Cr/Pt-SrTiO3-Ti memristor. The device measurement on the right mirrors the biological functions of ST-Hebb synapses combining memory and computation as well as long- (W) and short-term (F) dynamics.

The complexity of biophysical mechanisms in synapses and the corresponding plasticity rules are essential for nervous system function (e.g., refs. 13,14,15), but are missing in conventional ANNs. This limited biological realism might partly explain why artificial intelligence (AI) systems often perform inferiorly to humans and animals in various aspects, such as motor skills and adaptability to dynamic environments16. Moreover, today’s ANNs consume vast amounts of energy due to the large network size required for complex tasks17. For instance, training the large language model GPT3 consumed 1.287 GWh of electrical energy18, enough to power over 100 households for a year.

To address these issues, a more bio-inspired model for synapses was developed, incorporating short-term and Hebbian plasticity, as well as meta-plasticity19. Specifically, this model, known as the ST-Hebb synapse, does not only perform the above-mentioned three functions of traditional ANN synapses but also includes additional roles such as (Fig. 1b): (4) storage of short-term memories (F) that decay over time, (5) short-term plasticity - the update of F (ΔF) during training and inference, and (6) meta-plasticity - the control over the decay time. To incorporate ST-Hebb synapses into a deep neural network (DNN), the short-term plasticity neuron (STPN) model has been proposed (Fig. 1c), combining a conventional neuron model with ST-Hebb synapses20. This model utilizes all six synaptic functions (1) to (6), incorporates meta-learning, can be integrated into multi-layer networks, and outperforms more conventional ANNs with less biologically realistic synapses in various challenging tasks.

The hardware of choice to run such neural networks are parallel computing architectures like graphics processing units (GPUs). However, GPU-based implementations of multi-functional synapses suffer from the computational overhead caused by the aforementioned additional synaptic operations. This trend is exacerbated by the large amount of synapses building state-of-the-art neural networks, ranging from 106 to 101421. On top of that, the operations governing ST-Hebb’s synaptic dynamics are memory bound and are thus negatively affected by the well-known von Neumann bottleneck imposed by physically separated memory and processing units22. These factors render the implementation of ST-Hebb synapses on GPUs inefficient, thus motivating the development of new hardware paradigms that are better suited to neural networks with multi-functional synapses.

Several promising neuromorphic architectures use memristors as hardware synapses because of their ability to collocate memory and computation in a single device, which circumvents the von Neumann bottleneck23. Memristors are two-terminal devices that can change their conductance state upon electrical24,25 or optical26,27 stimuli, similar to the change of the synaptic coupling (weight) upon a neuronal spike in biological systems. A growing body of research suggests that the rich internal dynamics of memristors can be leveraged to mimic biophysical processes taking place in synapses and neurons28,29.

There have been multiple demonstrations of bio-inspired hardware synapses realized using memristors with both long- and short-term dynamics30,31 that exhibit biological learning rules such as triplet spike-timing-dependent plasticity (triplet-STDP)32 or Bienenstock-Cooper-Monroe (BCM)33. However, these demonstrations rely on spike timing plasticity rules and can therefore not be integrated into DNNs34, which limits their applicability. Meanwhile, a single-layer neural network that makes use of bio-inspired, multi-functional synapses was recently demonstrated on memristive hardware35. The authors showed the benefit of adding short-term synaptic plasticity during inference for a classification task in dynamically changing environments. Memtransistive devices were used as synapses. In addition to the two electrical contacts common to all memristors, they possess a gate analogous to transistors. To realize decaying traces a voltage signal with the shape of the short-term decay was applied to the third gate contact. Short-term plasticity is therefore not an intrinsic property of these devices, i.e., the devices do not inherently exhibit short-term memory, but require an additional stimulus to do so. The need for three-terminal devices and precisely engineered voltage signals applied to each memtransistive synapse poses challenges for a large-scale implementation of such systems, because the required control circuit and wiring would rapidly become considerably complex. Therefore, the introduction of a two-terminal memristive device that intrinsically encompasses all six synaptic roles (1–6) is key to enable scalable neuromorphic hardware that is not only energy-efficient, but also reaches or even surpasses the performance of conventional AI approaches.

In this work, we propose such a two-terminal memristive device that relies on the valence-change switching mechanism in SrTiO3 (STO)36 and intrinsically possesses the six operations needed to function as an ST-Hebb synapse. A symbolic representation on top of an SEM image of the fabricated nanoscale device is shown in Fig. 1d. The measured memristor conductance acts as the plastic synaptic weight and mirrors the behavior displayed in Fig. 1b. Specifically, our device can store two different states in its memory, (I) a state with slow dynamics (long-term weight W) and (II) a state with fast dynamics (short-term weight F), which are both encoded in the conductance of the memristor. In terms of computation, the four synaptic operations labeled 2, 3, 5 and 6 in Fig. 1b can all be performed by our STO devices: (III) Long-term plasticity (i.e., change in the long-term weight W) and (IV) short-term plasticity (short-term weight update ΔF) can both be triggered by voltage pulses of different magnitudes. Notably, the short-term decay happens spontaneously, without the application of a complex signal. (V) Meta-plasticity (i.e., control over the decay time) can be achieved by applying a DC bias voltage to one of the two terminals, which limits the complexity of the control circuit and wiring. (VI) Additionally, our devices provide the standard in-memory multiplication capabilities of the input (voltage U) by the synaptic weight (conductance G), which is realized by Ohm’s law I = G ⋅ U. They also exhibit low cycle-to-cycle variability due to their non-filamentary switching operation. As a consequence, the random displacement of few atoms does not induce as much noise as in filamentary valence-change-type memristors37. Moreover, we can operate our devices at very low conductance values (10s of nS), which lowers the power consumption during operation. Their achievable short-term timescales range from 10 milliseconds to 100’s of seconds. Importantly, timescales in the order of 100 seconds are typically difficult to realize with nanoscale footprints using other neuromorphic approaches such as analog circuits because the required capacitors rely on much larger dimensions38,39,40.

To estimate the energy consumption of our multi-functional hardware synapses in the context of a large DNN, we introduce a modified STPN (m-STPN) unit that emulates parts of the device characteristics and fully incorporates the measured energy consumption of our devices. We then integrate this unit into the original STPN network simulator of ref. 20 to perform a complex reinforcement learning task in software with multi-functional synapses, namely learning to play Atari’s video game Pong. The Atari suite is a common benchmark for reinforcement learning and is chosen here as an exemplary task for a dynamic environment. We show that the m-STPN unit enables faster and more stable training compared to the original version for the task of Atari Pong. A major reason for this is the introduced constraint on the short-term decay time constant imposed by our devices. Furthermore, we demonstrate that short-term weights with long timescales, such as the ones exhibited by our memristors, are required for a robust and fast training of the network. Finally, we compare the network’s energy consumption for a pure GPU implementation of the synapses with the estimated energy consumed by our memristive synapses. We demonstrate an estimated gain in energy efficiency between 96× and 966× , depending on the GPU implementation.

Results and discussion

Multi-functional synaptic behavior in single memristor

We fabricated a multi-functional memristive synapse on an STO single crystal substrate (Fig 1d). We chose STO as active material, because it is a versatile and well-understood platform with rich internal dynamics due to the generation and movement of oxygen vacancy defects41,42 that can be tuned by e.g., doping43,44,45, different electrode materials33,46 or interface engineering47. First, a high work function contact (Pt with a Cr adhesion layer beneath) was deposited. This step was followed by the fabrication of a Ti electrode with a Pt capping layer that prevents the Ti from oxidizing in air. Both contacts were deposited using electron beam evaporation and patterned by electron beam lithography with a subsequent lift-off process, resulting in a typical gap between the electrodes of roughly 40nm. The devices were annealed at 300 °C for 20min in flowing Ar, which causes a thermal oxide to form at the Ti-STO interface. The whole stack was finally covered with a uniform layer of 15nm of SiN. The fabrication process is discussed in detail in Methods section “Device fabrication”.

Figure 2a shows 30 cycles of the I-V characteristics of our Cr/Pt-STO-Ti memristor (Fig. 2b). The voltage (−2V to 2V) is applied to the Pt electrode, while the Ti one is grounded. A high cycle-to-cycle repeatability as well as low conductance values (10s of nS) are obtained, which allows for energy-efficient device operation. The low conductance values and the counter clockwise switching direction, as indicated by the black arrows, are attributed to a non-filamentary switching mechanism, which has already been reported for similar material stacks48. In this switching regime the conductance change is not caused by the formation of a filament made of oxygen vacancies (\({{{\rm{V}}}}_{{{\rm{O}}}}^{\bullet \bullet }\)) that bridges the two electrodes, but by the modulation of the Schottky barrier at the Pt-STO interface49. This modulation is attributed to generation and recombination of \({{{\rm{V}}}}_{{{\rm{O}}}}^{\bullet \bullet }\)’s upon the application of an external voltage (bottom of Fig. 2b). The vacancies in turn locally dope the STO, which changes the height and width of the Schottky barrier, affecting the conductance. When a positive voltage is applied to the Pt contact, oxygen from the crystal (\({{{\rm{O}}}}_{{{\rm{O}}}}^{\times }\)) moves to the Pt-STO interface or into the porous Pt electrode, leaving behind a positively charged crystal defect42. This kind of n-type doping increases the conductance due to a decrease in the Schottky barrier height and width. Since \({{{\rm{V}}}}_{{{\rm{O}}}}^{\bullet \bullet }\)’s are mobile and positively charged, they migrate away from the Pt electrode along the applied electric field towards the Ti electrode, where they accumulate and potentially form a filament in a process called electroforming50. We observed that for high positive voltages (>4V) we are able to electroform our device and put it in a filamentary-switching operation (Supplementary Section S2). This confirms the generation of \({{{\rm{V}}}}_{{{\rm{O}}}}^{\bullet \bullet }\)’s at positive voltages in our devices and allows to distinguish the filamentary and non-filamentary regime based on an analysis of the IV characteristics.

Fig. 2: DC and dynamical behavior of multi-functional memristive synapses.
figure 2

a Conductance vs. voltage characteristic (30 cycles) of the fabricated Cr/Pt-STO-Ti memristors. The black arrows indicate the counter-clockwise switching direction. b Sketch of the device stack and of the underlying switching mechanism. The two insets zoom into the Pt-STO interface at different applied voltages, showing the dynamics of interfacial oxygen ions (O) and oxygen vacancies (\({{{\rm{V}}}}_{{{\rm{O}}}}^{\bullet \bullet }\)): (Vapp > 0) At positive voltages \({{{\rm{V}}}}_{{{\rm{O}}}}^{\bullet \bullet }\) formation and migration occurs. The negatively charged oxygen migrates towards the interface and into the porous Pt electrode and the positively charged \({{{\rm{V}}}}_{{{\rm{O}}}}^{\bullet \bullet }\) move along the electric field away from the Pt electrode and towards the grounded Ti electrode. (Vapp ≤ 0) At zero applied voltage, \({{{\rm{V}}}}_{{{\rm{O}}}}^{\bullet \bullet }\)'s move back towards the Pt, driven by the built-in electrochemical gradient, where they get filled by oxygen. A negative voltage accelerates this process. c Conductance change from low to high under the application of 100 SET pulses with an amplitude of 4V and a duration of 500 μs. d Time-dependent conductance measurement (read out at 0.6V) when voltage pulses (2V, 2.5V, and 3V) with a duration of 100μs are applied. The pulses induce short-term increases of the conductance with subsequent decay. The long-term conductance (red area) remains constant. e Short-term conductance changes due to the voltage pulses in the dotted rectangle of (d). Only the conductance values during the read voltage are shown here. f Aggregate plot showing short-term plasticity for different values of the long-term weight W. The measurement data was obtained by first applying the protocol in (d) to characterize the short-term plasticity for the minimum long-term weight (W1). The long-term weight was then changed by 100 SET pulses (c) and, after a waiting period of 240s, the short-term plasticity was measured again.

The amount of vacancies generated as well as the distance over which the \({{{\rm{V}}}}_{{{\rm{O}}}}^{\bullet \bullet }\)’s migrate from the Pt contact depend on the voltage and duration of the applied electrical signal51. Long electrical pulses at high voltages are expected to lead to a high vacancy concentration extending far away from the Pt electrode whereas short, low-voltage pulses result in a relatively small vacancy concentration close to the Pt. After these pulses the generated vacancies migrate back towards the Pt contact without an external voltage, driven by a gradient in electrochemical potential41 and get filled there by the interfacial oxygen52. In addition to the incorporation of molecular oxygen from the porous Pt electrode also atmospheric water vapor can lead to the filling of oxygen vacancies by incorporating oxygen from water molecules into STO53. Through such processes vacancies start disappearing from the vicinity of the Pt contact, forming a growing, vacancy-free region. Since the Schottky barrier is mainly sensitive to the vacancy concentration immediately adjacent to the Pt electrode even small vacancy movements in this region significantly change the contact resistance and thus the overall device conductance42, explaining the observed conductance decay in our memristors. Furthermore, the vacancies close to the Pt get annihilated first in timescales of minutes (short-term), while the vacancies further away require an increasingly long time to migrate back resulting in timescales of multiple hours (long-term), as described in41. Therein, this slowdown is attributed to the built-in electric field at the Pt-STO interface, which decreases monotonically with the distance from the Pt electrode. It is also likely that the oxygen incorporation kinetics at the Pt-STO interface play a role in determining the short-term decay timescale52.

Additionally, the back migration flux of \({{{\rm{V}}}}_{{{\rm{O}}}}^{\bullet \bullet }\) and subsequent vacancy filling at the Pt-STO interface can be increased by the application of a negative bias, leading to a faster conductance decay. Hence, the decay time can be voltage-controlled. A summary of the postulated physical mechanisms and how they underlie the synaptic functions in Fig. 1b is given in Supplementary Section S6. Even though this physical picture supports our experimental observations it cannot be excluded that other effects play an important role in the switching process, such as interface trap states54 or protonic conduction, which is well studied in oxide-based memristors53,55,56,57. Further investigations will be needed to unequivocally determine the physical mechanism(s) at the origin of our devices’ behavior.

In our approach the memristor’s conductance implements the synaptic weight, whose dynamics (long- and short-term) are crucial in ST-Hebb synapses (Fig. 1b). To investigate the conductance dynamics of our STO memristors we apply pulses of different voltages and widths to them (Fig. 2c, d). We first induce long-term plasticity (function 3 in Fig. 1b) by applying 100 SET pulses with an amplitude of 4V and a duration of 500μs that cause the device to switch from a low to a high conductance state (Fig. 2c). This high conductance state slowly decays over thousands of seconds without applied bias (Supplementary Section S3). After the SET procedure we leave the device at 0V for 240s (not shown) to let it settle to a stable state. We then proceed with measuring the conductance of the device at 0.6V for 375s (Fig. 2d), during which 100μs-long pulses with voltages of 2, 2.5, and 3V, are applied. The long-term conductance induced by the SET pulses remains largely constant for the time period of the measurement. The 100μs-long pulses lead to a short term conductance increase, i.e., short-term plasticity (function 5 in Fig. 1b), whose magnitude depends on the pulse voltage (3, 5 and 10nS for 2, 2.5 and 3V, respectively) and is followed by a decay. This can be observed in Fig. 2e, where the conductance during the three last pulses of the protocol (dotted rectangle in Fig. 2d) is plotted. The conductance during the read voltage is shown, omitting the values during the 100μs-long pulse. In Fig. 2f the long- and short-term components of the conductance (functions 1 and 4 in Fig. 1b) are visualized for six measurements with different values of the long-term weight W. The measurement data was obtained by repeating the protocol of Fig. 2c, d multiple times, i.e., first setting the long-term weight (W1, W2, ...) by 100 SET pulses, waiting for 240s, and then applying the short-term pulse protocol of Fig. 2d. Values of the long-term conductance in the range of 12 to 23nS can be set in this way (long-term plasticity). These conductance values can further undergo short-term increases induced by voltage pulses (short-term plasticity). The obtained collocation of both long- and short-term plasticity motivates the use of this devices as ST-Hebb synapses.

The short-term plasticity is investigated in more detail in Fig. 3a, which displays the mean (solid line) and standard deviation (shaded area) of five measurements. Pulse-induced short-term conductance updates (ΔF) and subsequent decays are obtained using four different voltage amplitudes (2, 2.5, 3, and 3.5V). The pulse width was fixed to 100μs and the read voltage to 0.6V. The conductance values were normalized by subtracting the initial conductance at t=0 from the data. We observe low cycle-to-cycle variability, in agreement with the I-V characteristics in Fig. 2a. The same measurement was repeated for two additional pulse widths (20 and 500 μs). The resulting ΔF’s are reported in Fig. 3b as a function of the pulse amplitude and width. It can be seen that ΔF values in the range of 0.7 - 38.6 nS can be achieved by adjusting these parameters. The corresponding energy per pulse is given in Fig. 3c for the same pulse voltage and width combinations. The details of the energy calculations are given in Supplementary Section S7 and a measurement with 200 pulse cycles is given and discussed in Supplementary Section S4.

Fig. 3: Control over the magnitude and dynamics of short-term conductance updates.
figure 3

a Mean (solid line) and standard deviation (shaded area) of pulse-induced short-term conductance updates (ΔF) from five conductance measurements and using four different pulse voltages (2, 2.5, 3, and 3.5V). The read voltage is set to 0.6V and the pulse width to 100 μs. To better compare the measurements, the conductance values were adjusted by subtracting the initial conductance at t=0 from the data. b Heatmap of the achieved ΔF for the different pulse voltages and widths. c Heatmap of the required pulse energy for the same voltage and width combinations as in (b). d (Top) Applied voltage protocol on a linear x-axis using 0.6V/200 μs read pulses with a period of 700 μs and (bottom) corresponding conductance values shown on a logarithmic x-axis. In between the read pulses a constant bias voltage (Vbias) of variable amplitude is applied. The main pulse voltage and width are set to 3.5V and 500 μs, respectively, in all measurements. The mean and standard deviation of the adjusted conductance values are shown for 5 measurements on a semi-log plot. e Extracted decay time constant Λ from the measurements in d) as a function of Vbias. The experimental data points were fitted with a sigmoid function.

Besides the magnitude of the conductance increase, it is also possible to control the subsequent decay using a DC bias voltage (Vbias) that is constantly applied during the experiment (Fig. 3d), effectively implementing meta-plasticity (function 6 in Fig. 1b). The mean and standard deviation of the conductance for five measurements are shown as a function of time. The voltage pulse that triggers the conductance increase is the same in all cases (3.5V / 500μs), thus resulting in similar ΔF, whereas the bias voltage is varied (see Supplementary Section S9 for details). The timescale of the decay increases with increasing Vbias from hundreds of ms (Vbias = −0.6V) to tens of seconds (Vbias = 0.6V). To quantify the resulting decay time constant (Λ) as a function of the bias voltage, we fitted an exponential to the measured curves (Supplementary Section S9). In our fit, the maximum value of Λ = 1 indicates no decay and the minimum value (Λ = 0) corresponds to immediate decay. Similar measurements were performed on other devices to qualitatively assess device-to-device variability (Supplementary Section S5). Figure 3e demonstrates that we can experimentally control Λ over a range from 0.08 to 0.92 as a function of the applied Vbias. The relationship between Vbias and Λ is modeled by a sigmoid function \(\Lambda ({V}_{bias})=\frac{L}{1+\exp (-k\cdot ({V}_{bias}-{V}_{0}))}+{\Lambda }_{0}\), where L, k, V0, and Λ0 are fitting parameters.

In summary, the following functions are performed intrinsically by our memristors: Storing both (1) long- (W) and (2) short-term (F(t)) weights (Fig. 2f), (3) long-term plasticity (Fig. 2c), (4) short-term plasticity (Fig. 3a, b), (5) meta-plasticity via control over the decay time parameter Λ (Figs 3d, e), and (6) multiplication of the input voltage with the synaptic weight according to Ohm’s law.

DNN with multi-functional memristive synapses

The six intrinsic functionalities of our memristors can be utilized by ST-Hebb synapses in a deep STPN network. Such networks have been shown to outperform traditional DNN implementations without multi-functional synapses at a variety of complex tasks in dynamic environments20. One such dynamic task is learning to play Atari Pong, a video game and common machine learning benchmark. In Pong a player (the STPN network) confronts an opponent, each manipulating a vertically movable bar to strike a ball, aiming to get the ball past the opponents bar (i.e., scoring a point) or preventing the opponent from doing so. The game concludes when either player scored 21 points. The STPN network’s reward is the difference between the player’s and the opponent’s points at the end of the game. Given only this scalar reward as input, the network finds a strategy that results in the maximum score of 21 by repeatedly playing the game and employing reinforcement learning, a bio-inspired learning paradigm58. Below we show the development of a modified STPN unit (m-STPN), which are STPNs20 with a modified weight normalization scheme (see Methods section “Modified STPN model” for details and benefits of this approach). These units make use of our multi-functional synapses to play Atari Pong. Through simulation we could estimate the energy consumption of the whole network if it were running on our memristive hardware and compare it to a pure GPU implementation.

Modified short-term plasticity neuron

The deep STPN network simulator investigated here (Fig. 4a) employs a network layer consisting of our modified STPN units (m-STPN layer). The network itself relies on an actor-critic architecture that takes frames of the Atari Pong environment as inputs and computes both the next action to take in the environment (actor) as well as an estimation of the value of the current state (critic). The frames are first processed by two convolutional layers into a dense feature set that forms the input for the m-STPN layer. The latter consists of 64 m-STPN units, each of which is connected through ST-Hebb synapses to the 2592 inputs as well as recurrently to 64 outputs. In total, this amounts to (2592 + 64) ⋅ 64 = 169984 synapses. The output of the m-STPN layer is then fed into two fully-connected linear layers that compute the next action (the actor’s next step to take in the game) and the current value (how advantageous is the current game state). To compare the influence of the STPN implementation on the training performance, three networks with different STPN layers (m-STPN, STPN, and no plasticity) were investigated (Fig. 4b). Here, the reward during training is plotted as a function of the steps taken by the actor (see Methods section “Network training” for details). Each curve represents the average reward of 16 agents that learn to play the game with different randomly initialized parameters. We observe that in terms of training speed both m-STPN and STPN outperform the no plasticity implementation (i.e., a traditional recurrent layer without time dependent synaptic weights). Furthermore, the m-STPN version learns slightly faster than the original STPN network, while also exhibiting a much smaller standard deviation among different training runs (shaded areas in Fig. 4b). While the robust training performance of the m-STPN layer is encouraging, the main aim of our m-STPN’s is to show that our mutli-functional memristors can act as hardware ST-Hebb synapses in the STPN network of Fig. 4a. To achieve this, the following device characteristics where implemented into the m-STPN units: (1) mapping of the memristor conductance (Gmeas) to the simulated, unitless synaptic weight (G) by the linear relationship

$$G=({G}_{meas}-{G}_{min})/m$$
(1)

with m = 2nS and Gmin = 12nS. (2) Adding a discretization operation to the simulated short-term weight update (ΔF) that limits the number of ΔF values (states) to an amount that can be resolved by our memristors. To satisfy this requirement, the conductance values corresponding to two adjacent states should be separated by at least one standard deviation, which is below 1nS for all short-term weight updates ΔFmeas (max.  ± 0.9 nS in Fig. 3b). We therefore chose a discretization step of 1nS for ΔFmeas, which translates to a step of 0.5 for the simulated ΔF according to Eq. (1). (3) Fixing the maximum of ∣ΔF∣ to 20, which makes sure that the weight update remains in a range that is achievable by the STO memristors. A histogram of ΔF for all synapses during an entire Pong game, with and without non-idealities, is given in Supplementary Section S11. (4) Limiting the range of the decay time constant Λ to values that can be reached by our devices ([0.08, 0.92]). Furthermore, it was observed that the constraining of Λ also has an impact on the training performance of the network, as shown in Fig. 4c. The five lines denote different constraints imposed on the learned decay time parameter Λ. Notably, it is beneficial to incorporate synapses with large decay time constants during training: The larger the upper limit of Λ the faster the reward increases. Unexpectedly, the case with Λ = 0 (i.e., immediate decay of the short-term weight changes for all synapses) also learns, albeit slower and less robustly, as can be seen from the larger standard deviation compared to Λ = [0.08, 0.92] (inset of Fig. 4c). The longer, constrained decay times were made possible by the modified weight normalization scheme in m-STPN’s (Methods section “Modified STPN model”). Because the decay constant Λ is naturally limited in our devices, destabilizing phenomena such as an exponential gain (Λ > 1) instead of a decay (Λ < 1) are automatically prevented. Also note that non-volatile memristive devices, which correspond to a Λ = 1, are insufficient for the implementation of synapses in STPN networks (Supplementary Fig. S15)

Fig. 4: Simulation and energy consumption of an STPN network with multi-functional synapses.
figure 4

a Sketch of the full STPN network. A frame of the Atari game is fed into two convolutional layers: Conv(kernel, stride) plus a ReLU activation function. The features are then fed into the m-STPN layer (blue ellipses and lines). The layer’s output is split into actions and a value by two fully connected linear layers. b, c Average reward as a function of agent steps during training for (b) three different implementations of the STPN layer and (c) five different ranges of Λ. Each curve represents the average reward of 16 agents with different random parameter initialization. The shaded area denotes the standard deviation. In the inset of c), the cases Λ = 0 and Λ = [0.08, 0.92] (i.e., the achievable device range) are shown. d Total synaptic weight (long- and short-term component) of a single synapse of the trained network (SmaxF}) during an entire game. The Zoom-in additionally shows the long-term weight W in red and the ΔF as black bars. e Energy consumed by our memristors due to ΔF updates, i.e., voltage pulses with widths wp, fitted by a power-law. f Power consumed by our memristors due to different Decay bias voltages (Vbias). g Time evolution of the energy consumption of the synapse in (d) during an entire Pong game for a memristor (blue) and a pure GPU implementation (orange). Different energy contributions and the total energy are shown. h Histograms of all synapses in the network, indicating how many synapses consume a specific amount of energy during the whole Pong game. The two contributions ΔF and Decay are shown. For the Decay the worst case scenario: Vbias = 0.6 is assumed for all synapses. i Total energy histogram (ΔF plus Decay).

After training some of the 16 trained agents achieve the maximum reward of 21 (Supplementary Section S12). The total synaptic weight value G = W + F of a single synapse of such a trained agent is reported in Fig. 4d over the course of an entire game that lasts roughly 50 seconds. This specific synapse was chosen because it exhibits the largest synaptic changes (ΔF) in the whole network. It is therefore referred to as SmaxF} in the remainder of the text and will serve as a representative example for the behavior of a synapse in an STPN network. It is observed that the value of the synapse’s weight G changes over time due to the short-term plasticity of ST-Hebb synapses. Importantly, the short-term updates are sparse, which makes the implementation of this reinforcement learning task energy efficient on our memristive hardware as only a small number of energy consuming short-term weight updates (ΔF) are needed. The zoom-in additionally shows both the long-term weight component W (in red) and the short-term weight updates ΔF (in black). Each simulation timestep is marked by a dot.

Energy consumption of deep STPN network

Next, we estimate the energy consumption of synapse SmaxF} for the duration of the entire game if it were implemented on our memristor. Two sources of energy loss are considered: Firstly, each voltage pulse that causes a short-term weight update consumes energy (Epulse) (Fig. 3c). Secondly, due to the application of a constant bias to control the decay time a small current continuously flows through the devices, inducing a power loss (Pbias). We address these two components separately. Figure 4e reports the first one (Epulse) as a function of the short-term weight updates ΔF. This quantity is extracted from the measurement data in Figs. 3b and 3c, for different pulse widths (wp). The measured energy data points closely follow a power law relation: EpulseF) = c ⋅ (ΔF)α with c = 30pJ and α = 1.52. This power law relation was incorporated into our neural network simulator to estimate the energy consumption of the short-term weight updates in our memristors. Because the value of ∣ΔF∣ is limited to 20 and because the weight updates are sparse, this first contribution to the energy consumption remains low. In Fig. 4f the second contribution to the energy consumption (Pbias) is given as a function of the total synaptic weight G. It is calculated according to \({P}_{bias}=| {G}_{meas}| \cdot {V}_{bias}^{2}\). Note that even for a simulated weight of G = 0 there is a remnant power draw (except if Vbias = 0) because of the finite minimum conductance value Gmin = 12nS of the physical devices. For the maximum bias voltage Vbias = 0.6 the power consumed by a synapse with a constant weight of G = 0 is therefore 4.3 nW. This low power consumption is a direct consequence of our memristor’s low conductance values, enabled by their non-filamentary switching behavior.

In Fig. 4g the estimated energy consumed during inference over the course of a Pong game by either a memristor (blue) or a pure GPU implementation (orange) of synapse SmaxF} is provided. In the memristor case, the energy consumption can be decomposed into two contributions, the short-term weight updates (ΔF) and the applied bias voltage needed to control the decay time constant (Decay). These two components cover the short-term synaptic plasticity and meta-plasticity required by an ST-Hebb synapse during inference. The standard input-weight multiplication is obtained through Ohm’s law I = G ⋅ Vread, where Vread encodes the input. The power consumed by this operation is however already accounted for by Pbias: The current resulting from the application of the maximum bias voltage \(\max \{{V}_{bias}\}=0.6V\) can be read out to compute the input-weight multiplication. To implement the same plasticity, meta-plasticity, and input-weight multiplication on a GPU the following four operations need to be executed at every time step during the game (6826 in total): (1) Element-wise addition of short- and long-term weight components, (2) element-wise multiplication of F with Λ for the short-term decay, (3) element-wise addition of F and ΔF for the short-term weight update, and (4) vector-matrix multiplication of inputs and weights (weight mult.). For each of these operations the GPU’s energy consumption was measured for a single synapse (see Methods section “GPU energy measurement”). It is found that the energy consumption of the memristor increases more slowly with the number of timesteps than the GPU baseline. It should however be noted that even though our multi-functional memristive synapse can fully mimic the behavior of an ST-Hebb synapse, the operations of the neuron still need to be performed on a GPU: This concerns the calculation of the magnitude of ΔF via the first term in Methods Eq. (4), the calculation of the non-linear activation function in Methods Eq. (3), and the normalization of the pre-synaptic input (Supplementary Fig. S11b).

To estimate the total synaptic energy consumption of the whole network the contribution of each synapse for an entire game of Pong has to be considered (Fig. 4h). Both the energy consumed by the ΔF updates (dark blue), and by the control of the decay time constant (light blue) are shown in the form of a histogram. Most synapses do not undergo any short-term weight update during the entire game and therefore do not consume energy for this operation, as indicated by the large ΔF spike centered around 0. For the decay control, we assume the worst-case scenario where a bias voltage of 0.6 V is applied to all synapses. The current due to this bias can be read out, which accounts for the energy consumption due to the calculation of the vector-matrix multiplication between the input and the weights. A crossbar array architecture is assumed for this purpose.

The total energy (i.e., ΔF plus Decay) consumed by each memristive synapse is shown in the histogram of Fig. 4i. By summing up the contribution from all synapses we obtain a total energy consumption of 36mJ (Memristor row in Table 1). This value takes into account the four synaptic operations (ΔF, Decay, W+F, and weight multiplications) of all memristive synapses of the entire STPN network for a whole Pong game. To give a nuanced comparison with a pure GPU implementation, we provide two separate measurements using an NVIDIA A100 40GB device (Method section “GPU energy measurement” for details). We report the median of 100 individual runs per synaptic operation for half- and single-precision floating-point arithmetic (fp16 and fp32, respectively).

Table 1 Energy consumed in mJ by the whole STPN network during one game of Atari Pong

First, we measure the GPU’s energy consumption for executing each synaptic operation for all the network’s 169984 multi-functional synapses. The results are shown in the GPU (standard) row. It is observed that roughly one third of the total energy consumption stems from the three ST-Hebb specific operations (ΔF, Decay, and W + F) and two thirds from the standard input-weight multiplication. We note that since the GPU is a massively parallel machine, this number of synapses may not fully utilize the device, potentially leading to lower energy efficiency. Indeed, the A100 GPU achieves the highest energy efficiency for a hypothetical network with around 221 synapses. The case labeled GPU (optimal) is the corresponding energy consumption scaled to the original network’s number of synapses. By comparing the fp16 case of the GPU (optimal) energy consumption with the total in the Memristor row an improvement of a factor of 96 is obtained. The saved power is due to both the multi-functional nature of our memristors and their in-memory compute capabilities, which in combination allow for the simultaneous computation of four operations without any memory traffic. The absence of memory traffic is especially beneficial, because all operations considered (i.e., element-wise and vector-matrix multiplication) have little to no data reuse and are memory-bound. As a consequence, most energy is consumed in data movement rather than computation (von Neumann bottleneck). This is demonstrated in Method section “GPU Energy measurement” where we quantify the energy consumption of the GPU’s memory traffic: It accounts for more than 98% of the total. We also provide a discussion on the latency and energy delay product (EDP) of our implementation in Supplementary Section S17. Note that the energy consumption in the memristor case was estimated from the behavior of individual devices and not based on a comprehensive circuit simulation encompassing the whole STPN network. Although such investigations would certainly lead to increased energy consumption59, we believe that the memristor advantage is large enough (two orders of magnitude) to persist even under more realistic conditions.

In conclusion, we presented a two-terminal memristor based on STO that is able to store and compute both long- and short-term synaptic weight updates, effectively collocating memory and computation as well as long- and short-term dynamics. In particular, we demonstrated control over the short-term decay time constant without the need for an additional electrical contact or complex control signals, which implements a form of intrinsic meta-plasticity. All these features are essential for neuromorphic circuit implementations, e.g., STPN networks, which outperform traditional artificial neural networks in large-scale, complex machine learning tasks such as Atari Pong. We contributed here to the development of these networks with the introduction of m-STPN units, increasing the reliability during training and highlighting the importance of long decay time constants. Finally, in simulation, we compared our memristor implementation of an STPN network to a GPU one and obtained a significant increase in inference energy efficiency by a factor of at least 96.

To fully realize our simulation concept in hardware, further work is needed: Firstly, our STO memristors should be converted to vertical structures, which is expected to reduce device to device variability and also allows for the creation of crossbar arrays. In such a vertical, thin-film-based structure the spacing between the electrodes could most likely be significantly decreased, as compared to our planar devices, which in all likelihood will lead to lower operating voltages. Secondly, the long-term retention of our memristors should be improved, while still preserving their short-term plasticity. It has been suggested that an oxide layer between the Pt electrode and STO could increase the retention of low conductance states51. Moreover, since we observed a significant impact of the decay time constant on the training performance, different decay models should be investigated for both long- and short-term components in STPN networks. The advancement of such neural networks inspired by biology holds the potential to significantly increase the performance of AI applications across diverse dynamic environments. Furthermore, multi-functional memristive synapses with intrinsic dynamics could function as a key enabling technology for the energy efficient hardware implementation of next-generation neural networks.

Methods

Device fabrication

The STO single crystal substrate was first submersed into a 90 °C DI water bath under UV light illumination for 100 min60. The substrate was then baked at 250 °C for 5min and subjected to an O2 plasma treatment (200W) for 3 min. This water leaching surface treatment is expected to produce an atomically flat, predominantly TiO2-terminated surface, which is characterized by terraces of 1 unit cell (u.c.) height. This was indeed observed at several locations of the substrate, as shown in Supplementary Fig. S1. Both electrode stacks (Cr-Pt and Ti-Pt) were then patterned using e-beam lithography and deposited by e-beam evaporation (Supplementary Figs. S2a and S2b). After deposition the whole device was subsequently annealed at 300 degrees for 20 min in Ar atmosphere (Supplementary Fig. S2c). This step causes a thermal oxide to form at the Ti-STO interface, leaving behind oxygen vacancies61. Annealing also likely leads to diffusion of chromium into STO, doping the STO in the process62. The device stack was finally encapsulated within 30nm of SiN using plasma enhanced chemical vapor deposition (PECVD) to protect against oxidation (Supplementary Fig. S2d). The STO single crystal substrate was characterized by a four point probe measurement, which resulted in a surface resistance of  >10 GOhm, exceeding the measurement limit of the setup. We can therefore safely ignore surface contributions to our device conductance.

Experimental setup

The quasi static I-V characteristics were measured with a Keysight M9601A Source Measure Unit. Voltage pulses were generated with a Keysight 33500 Arbitrary Waveform Generator. The current was fed through a DHPCA-100 trans-impedance amplifier from Femto and read out with a Rohde&Schwarz RTE 1104 oscilloscope.

Modified STPN model

The equations describing the forward pass through an STPN layer follow20:

$${{{\bf{G}}}}^{(t)}={{\bf{W}}}+{{{\bf{F}}}}^{(t)}$$
(2)
$${{{\bf{h}}}}^{(t)}=\tanh ({{{\bf{G}}}}^{(t)}{{{\bf{x}}}}^{(t)})$$
(3)
$${{\bf{F}}}^{(t+1)}=\underbrace{{{{\mathbf{\Gamma}}} \odot ({{\bf{x}}}^{(t)} \otimes {{\bf{h}}}^{(t)})}_{{{{\Delta}} {{\mbox{F}}}}}}+\underbrace{{{{\mathbf{\Lambda}}} \odot {{\bf{F}}}^{(t)} }_{{{\mbox{Decay}}}}}$$
(4)

where bold letters denote matrices,  ⊙ element-wise multiplications and  ⊗ outer products. The STPN layer model is parameterized by the long-term weight W, the Hebbian association strength Γ, and the short-term decay parameter Λ. During training these 3 parameters are learned using back propagation through time (BPTT). While W directly controls the synaptic strength the Λ and Γ parameters define how the synaptic weight responds to stimuli, effectively implementing a form of meta-plasticity or learning to learn. The plastic update of the synapse is modeled by Eq. (4). Equations (3) and (4) are adapted slightly from the original work in20 to reflect the specific implementation here. In addition to Eqs. (2) to (4) the original STPN model also includes a form of normalization on both the synaptic input \(x\to {x}_{eff}=\frac{x}{| | W+F| | }\) and the plastic weight \(F\to {F}_{eff}=\frac{F}{| | W+F| | }\) (Supplementary Fig. S11a). This speeds up stochastic gradient descent during training. The normalization of F leads to a modification of Eq. (4) where the decay parameter Λ becomes \({\Lambda }_{eff}=\frac{\Lambda }{| | W+{F}^{(t)}| | }\). As a consequence, the decay time constant changes at every time step, because F(t) varies over time. Such variations can lead to instabilities during training and they cannot be straightforwardly implemented on our memristors. Another consequence is that the decay time constant Λ cannot be a priori constrained to a certain range because Λeff depends on the values of W and F, which are unknown at the start of training. However, clamping Λ is important as training becomes highly unstable if synapses reach values Λeff > 1 (see Supplementary Fig. S15). The solution adopted to circumvent this issue in the original formulation of20 consisted of starting with small values of Λ at the beginning of the training to ensure that Λeff does not exceed 1. This has the disadvantage that the network only slowly learns longer decay time constants. By removing the normalization of the plastic weight F and only normalizing the input (Supplementary Fig. S11b) in our modified STPN unit we achieve a better performance during training and also make the implementation on memristors feasible.

Network training

We closely follow the training protocol established in ref. 20. Concretely we use RLLib63 to train and evaluate agents in PongNoFrameskip-v4. During training the network repeatedly plays against the computer opponent of the gymnasium software library (a common python implementation of Atari game environments) on standard difficulty setting (0 out of 3). Preprocessing (dimensionality and color scale) for the game frames is done as in ref. 64 with the exception of frame stacking, which was omitted. The training parameters were also adopted from ref. 20: rollout length (50), gradient clipping (40), discount factor (0.99) and a learning rate starting at 0.0001 with a linear decay schedule finishing at 10−11 at 200 million iterations. Models are trained from the experience collected by 4 parallel agents.

GPU Energy measurement

To fairly compare the efficiency of a memristor and GPU implementation of the network in Fig. 4a, it is essential that the GPU’s energy consumption is only measured for the specific arithmetic operations that can be performed on the memristor: (1) W+F, (2) Decay, (3) ΔF and (4) weight multiplication (for more details see Supplementary Section S15). On the GPU, these kernels take the form of (matrix) additions and multiplications that can be optimally performed on such hardware. A dedicated code was implemented in Python: It runs each kernel separately. To measure the GPU’s energy consumption, we use the pyJoules library65, which is a Python wrapper for NVIDIA’s own energy reporting framework, NVIDIA Management Library (nvml). Since all operations have very short runtimes, to improve accuracy, we measure the energy spent for 10,000 to 200,000 executions of the corresponding kernels. We report the median of 100 multi-executions and estimate the 99% confidence interval (CI) using bootstrapping with 1000 samples. We report the GPU energy consumption of operations (1) to (4) in three ways: (I) per full Atari Pong game of the whole neural network, which employs 64 ∗ 2656 = 169984 synapses and runs for 6826 time steps (Table 1 in the main text), (II) per operation, i.e., a single synapse and one time step (Table 2 in this section), and (III) for a single synapse over the course of a full Atari Pong game, i.e., 6826 time steps (Fig. 4g).

Table 2 Energy consumed per floating point operation (flop) in pJ

(I) For the GPU (standard) results in Table 1, the matrices W, F(t), x(t), and Λ required by operations (1) to (4) have the same size as in the neural network simulation. For the GPU (optimal) results, we increase the size of the matrices using the formula (2592 + 64) ⋅ k, where k is a power of two and ranges from 64 (original network size) to 4096. We report the energy spent for k = 1024, which exhibits the highest energy efficiency, scaled down to the original network’s size (see Supplementary Fig. S17).

(II) The GPU (standard) and GPU (optimal) rows in Table 2 were obtained by dividing the values of Table 1 by the number of operations executed during the whole game (64 ∗ 2656 ∗ 6826). The energy is given per floating point operation (flop) in pJ. Note that the weight multiplication is computed by a fused multiply-add (FMA) operation, which counts as two flops (one for addition and one for multiplication).

In the GPU (compute) row of Table 2 we implement a CUDA kernel that only operates on data stored in registers without reading/writing from/to the GPU global memory. These results therefore measure the energy spent for the computation only, without the contribution of memory traffic. Concretely, each kernel execution performs as many arithmetic operations (addition, multiplication or FMA) as needed for one complete Atari Pong game. To increase accuracy, each measurement combines 10,000 kernel executions. As before, we report the median of 100 multi-executions and estimate the 99% CI. The energy consumption per flop for the weight multiplication correspond to approximately 5.9 and 9.5 pJ/flop for half- and single-precision. This is in the same ballpark as measurements provided by NVIDIA and independent testing of the GPU’s floating point unit (FPU)66,67, which validates our measurements. By comparing the GPU (compute) results with the GPU (optimal) we observe that the memory traffic accounts for more than 98% of the GPU’s total energy consumption. This result shows the remarkable energy efficiency of the GPU’s FPU and the benefit of reducing memory traffic. Note, however, that this particular GPU implementation would not be useful in practice, because the results of the kernel’s computations are not accessible via the memory and can therefore not be used by a program running on the GPU. For this reason the highest efficiency GPU benchmark that corresponds to a working implementation is the fp16 energy measurement in the GPU (optimal) row.

(III) For the GPU energy consumption of a single synapse shown in Fig. 4g we made use of the energy measurements per operation in the fp32 case of the GPU (standard) row in Table 2. It should be noted that the energy values in Table 2 were computed by first measuring the energy consumed by all synapses of the network in parallel and then divided by the number of synapses. This ensures that the massive parallelism of GPU’s is utilized, although we’re only interested in the energy consumption of a single synapse. The energy contributions per operation were then cumulatively summed for all time steps to obtain the time-series GPU data in Fig. 4g.

We note that the kernels utilize the GPU’s regular FP cores rather than the tensor cores because the operations (W+F, Decay, ΔF and weight multiplication) do not compute matrix-matrix products.

The specifications of our test system are:

Hardware:

  • GPU: NVIDIA A100 with 40GB memory

  • CPU: 2x AMD EPYC 7742 @ 2.25 Ghz (2 x 64/128 Physical/Logical Cores)

  • RAM: 512 GB

Software:

  • Rocky Linux release 8.4

  • Python 3.11.5

  • Pytorch 2.2.0 dev20230913

  • CUDA 12.1.1