1 Introduction

Audition and vision are unique among our senses: they perceive propagating waves. As a result, they bring us detailed information not only of our immediate surroundings but of the world much beyond as well. Imagine talking to a friend in a cafe, the door is open, and outside is a bustling city intersection. While touch and smell give a detailed sense of our immediate surroundings, sight and sound tell us we are conversing with a friend, surrounded by other people in the cafe, immersed in a city, its sounds streaming in through the door. Virtual reality ultimately aims to re-create this sense of presence and immersion in a virtual environment, enabling a vast array of applications for society, ranging from entertainment to architecture and social interaction without the constraints of distance.

Rendering. To reproduce the audio-visual experience given in the example above, one requires a dynamic, digital 3D simulation of the world describing how both light and sound would be radiated, propagated, and perceived by an observer immersed in the computed virtual fields of light and sound. The world model usually takes the form of a 3D geometric description composed of triangulated meshes and surface materials. Sources of light and sound are specified with their 3D positions and radiative properties, including their directivity and the energy emitted within the perceivable frequency range. Given this information as input, special algorithms produce dynamic audio-visual signals that are displayed to the user via screens and speaker arrays or stereoscopic head-mounted displays and near-to-ear speakers or headphones. This is the overall process of rendering, whose two components are visualization and auralization (or visual- and audio-rendering).

Rendering has been a central problem in both the graphics and audio communities for decades. While the initial thrust for graphics came from computer-aided design applications, within audio, room acoustic auralization of planned auditoria and concert halls was a central driving force. The technical challenge with rendering is that modeling propagation in complex worlds is immensely compute-intensive. A naïve implementation of classical physical laws governing optics and acoustics is found to be many orders of magnitude slower than required (elaborated in Sect. 3.2.1). Furthermore, the exponential increase in compute power governed by Moore’s law has begun to stall in the last decade due to fundamental physical limits [97]. These two facts together mean that modeling propagation quickly enough for practical use requires research into specialized system architectures and simulation algorithms.

Perception and Interactivity. A common theme in rendering research is that quantitative accuracy as required in engineering applications is not the primary goal. Rather, perception plays the central role: one must find ways to compute those aspects of physical phenomena that inform our sensory system. Consequently, initial graphics research in the 1970s focused on visible-surface determination [54] to convey spatial relations and object silhouettes, while initial room acoustics research focused on reverberation time [60] to convey presence in a room and indicate its size. With that foundation, subsequent research has been devoted toward increasing the amount of detail to reach “perceptually authentic” audio-visual rendering: one that is indistinguishable from an audio-visual capture of a real scene. Research has focused on the coupled problems of increasing our knowledge of psycho-physics, and designing fast techniques that leverage this knowledge to reduce computation while providing the means to test new psycho-physical hypotheses.

The interactivity of virtual reality and games adds an additional dimension of difficulty. In linear media such as movies, the sequence of events is fixed, and computation times of hours or days for pre-rendered digital content can be acceptable, with human assistance provided as necessary. However, interactive applications cannot be pre-rendered in this way, as the user actions are not known in advance. Instead, the computer must perform real-time rendering : as events unfold based on user input, the system must model how the scene would look and sound from moment to moment as the user moves and interacts with the virtual world. It must do so with minimal latency of about 10–100 ms, depending on the application. Audio introduces the additional challenge of a hard real-time deadline. While a visual frame rendered slightly late is not ideal but perhaps acceptable, audio lags may result in silent gaps in the output. Such signal discontinuities annoy the user and break immersion and presence. Therefore, auralization systems in VR tend to prioritize computational efficiency and perceptual plausibility while building toward perceptual authenticity from that starting point.

Goal. The purpose of this chapter is to present the fundamental concepts and design principles of modern real-time auralization systems, with an emphasis on recent developments in virtual reality and gaming applications. We do not aim for an exhaustive treatment of the theory and methods in the field. For such a treatment, we refer the reader to Vorländer’s treatise on the subject [102].

Organization. We begin by outlining the computational challenges and the resulting architectural design choices of real-time auralization systems in Sect. 3.2. This architecture is then formalized via the Bidirectional Impulse Response (BIR), Head-Related Transfer Functions (HRTFs), and rendering equation in Sect. 3.3. In Sect. 3.4, we summarize relevant psycho-acoustic phenomena in complex VR scenes and elaborate on how one must balance a believable rendering with real-time constraints among other system design factors in Sect. 3.5. We then discuss in Sect. 3.6 how the formalism, perception, and design constraints come together into the deterministic-statistical decomposition of the BIR, a powerful idea employed by most auralization systems. Section 3.7 provides a brief overview of the two common approaches to acoustical simulation: geometric and wave-based methods. In Sect. 3.8, we discuss some example systems in use today in more depth, to illustrate how they balance the various constraints informing their design decisions, followed by the conclusion in Sect. 3.9.

2 Architecture of Real-time Auralization Systems

In this section, we discuss the specific physical aspects of sound that make it computationally difficult to model, which motivates a modular, efficient system architecture.

2.1 Computational Cost

To understand the specific modeling concerns of auralization, it helps to juxtapose with light simulation in games and VR applications. In particular 

  • Speed: The propagation speed of sound is low enough that we perceive its various transient aspects such as initial reflections and reverberation, which carry distinct perceptible information, while light propagation can be treated as instantaneous;

  • Phase: Everyday sounds are often coherent or harmonic signals whose phase must be treated carefully throughout the auralization pipeline to avoid audible distortions such as signal discontinuities, whereas natural light sources tend to be incoherent;

  • Wavelength: Audible sound wavelengths are comparable to the size of architectural and human features (cm to m) which makes wave diffraction ubiquitous. Unlike visuals, audible sound is not limited by line of sight.

Given the unique characteristics of sound propagation outlined above, auralization must begin with a fundamental treatment of sound as a transient, coherent wave phenomenon, while lighting can assume a much simpler geometric formulation of ray propagation for computing a stochastic, steady-state solution [57]. Auralization must carefully approximate the relevant physical mechanisms underlying the vibration of objects, propagation in air, and scattering by the listener’s body. All these mechanisms require modeling highly oscillatory wave fields that must be sufficiently sampled in space and time, giving rise to the tremendous computational expense of brute-force simulation.

Assume some physical domain of interest with diameter \(\mathcal {D}\), the highest frequency of interest \(\nu _{\textrm{max}}\) and speed of propagation c. The smallest propagating wavelength of interest is \(c/\nu _{\textrm{max}}\). Thus, the total degrees of freedom in the space-time volume of interest are \(N_{dof}=(2\mathcal {D}\nu _{\textrm{max}}/c)^4\). The factor of two is due to the Nyquist limit which enforces two degrees of freedom per oscillation. As an example, for full audible bandwidth simulation of sound propagation up to \(\nu _{\textrm{max}}=20,\!000\) Hz in a scene that is \(\mathcal {D}=100\) m across, with \(c=340\) m/s in air: \(N_{dof}=1.9\times 10^{16}\). For an update interval of 60 ms to meet latency requirements for interactive listener head orientation updates [22], one would thus need a computational rate of over 100 PetaFLOPS. By comparison, a typical game or VR application will allocate a single CPU core for audio with a computational rate in the range of tens of GigaFLOPS, which is too slow by a factor of at least one million. This gap motivates research in the area.

2.2 Modular Design

Since pioneering work in the 1990s such as DIVA [86, 96], most real-time auralization systems follow a modular architecture shown in Fig. 3.1. This architecture results in a flexible implementation and significant reduction of computational complexity, without substantially impacting simulation accuracy in cases of practical interest.

Fig. 3.1
figure 1

Modular architecture of real-time auralization systems. The propagation of sound emitted from each source is simulated within the 3D environment to compute a directional sound field immersing the listener. This field is given to the spatializer component that computes appropriate transducer signals for headphone or speaker playback

Rather than simulating the global scene as a single system which might be prohibitively expensive (see Sect. 3.2.1), the problem is divided into three components in a causal chain without feedback:

  • Production: Sound is first produced at the source due to vibration, which, combined with local self-scattering, results in a direction-dependent radiated source signal;

  • Propagation: The radiated sound diffracts, scatters, and reflects in the scene to result in a direction-dependent sound field at the listener location;

  • Spatialization: The sound field is heard by the listener. The spatialization component computes transducer signals for playback, taking the listener’s head orientation into account. In the case of using headphones, this implies accounting for scattering due to the listener’s head and shoulders, as described by the head-related transfer function (HRTF).

Our focus in this chapter will be on the latter two components; sound production techniques such as physical-modeling synthesis are covered in Chap. 2. Here, we assume a source modeled as a (monophonic) radiated signal combined with a direction-dependent radiation pattern.

This separation of the auralization problem into different components is key for efficient computation. Firstly, the perceptual characteristics of all three components may be studied separately and then approximated with tailored numerical methods. Secondly, since the final rendering is composed of these separate models, they can be flexibly modified at runtime. For instance, a source’s sound and directivity pattern may be updated, or the listener orientation may change, without expensive re-computation of global sound propagation. Section 3.3 will formalize this idea.

Limitations. This architecture is not a good fit for cases with strong near-field interaction. For instance, if the listener’s head is close to a wall, there can be non-negligible multiple scattering, so the feedback between propagation and spatialization cannot be ignored. This can be an important scenario in VR [69]. Similarly, if one plays a trumpet with its bell very close to a surface, the resonant modes and radiated sound will be modified, much like placing a mute, which is a case where there is feedback between all three components outlined above. Thus, numerical simulations for musical acoustics tend to be quite challenging. The interested reader can consult Bilbao’s text on the subject [12] and more recent overview [14]. In the computer graphics community, the work in [104] also shows sound production and propagation modeled directly without the separability assumption, with special emphasis on handling dynamic geometry, for application in computer animation. Such simulations tend to be off-line, but modern graphics cards have become fast enough for approximate modeling of interactive 2D wind instruments in real-time [6].

2.3 Propagation

The propagation component takes the locations of a source and listener in the scene to predict the scene’s acoustic response, modeling salient effects such as diffracted occlusion, initial reflections, and reverberation. Combined with source sounds and radiation patterns, it outputs a directional sound field to the listener. Propagation is usually the most compute-intensive portion of an auralization pipeline, motivating many techniques and systems, which we will discuss in Sects. 3.7 and 3.8. The methods have two assumptions in common.

Linearity. For most auralization applications, it is safe to assume that sound amplitudes remain low enough to obey ideal linear propagation, modeled by the scalar wave equation. As a result, the sound field at the listener location is a linear summation of contributions from all sound sources. There are some cases in games and VR when the assumption of linearity may be violated, for instance with explosions or brass instruments. In most such cases, the non-linear behavior is restricted to the vicinity of the event and may be treated via a first-order perturbative approximation which amounts to linear propagation with a locally varying sound speed [4, 27].

Quasi-static scene configuration. Interactive scenes are dynamic, but most propagation methods assume that the problem may be treated as quasi-static. At some fixed update rate, such as a visual frame, they take a static snapshot of the scene shape as well as the locations of the source and listener within it. Then propagation is modeled assuming a linear, time-invariant system for the duration of the visual frame. The computed response for each sound source is smoothly interpolated over frames to ensure a dynamic rendering free of artifacts to the listener.

Fast-moving sources need to be treated with additional care as direct interpolation of acoustic responses can become error-prone [80]. An important related aspect is the Doppler Shift on first arrival, a salient, audible effect. It may be approximated in the source model by modifying the radiated signal based on source and listener velocities, or by interpolating the propagation delay of the initial sound. Another case violating the quasi-static assumption are aero-acoustic sounds radiated from fast object motion through the air. These can be approximated within the source model with Lighthill’s acoustic analogy [53], with subsequent linear propagation for real-time rendering [30, 31].

2.4 Spatialization

In a virtual reality scenario, the target of the audio rendering engine is typically a listener located within the virtual scene experiencing the virtual acoustic environment with both ears. For this experience to feel plausible or natural, sound should be rendered to the user’s ears as if they were actually present in the virtual scene. The architecture in Fig. 3.1 neglects the effect of the listener on global sound propagation. The spatialization system (shown to the right in the figure) inserts the listener virtually into the scene and requires additional processing. A properly spatialized virtual sound source should be perceived by the listener as emanating from a given location. In the simplest case of free-field propagation, a sound source can be positioned virtually by convolving the source signal with a pair of filters (also known as head-related transfer functions (HRTFs)). This results in two ear input signals that can be presented directly to the listener over headphones. For a more complex virtual scene containing multiple sound sources as well as their acoustic interactions with the virtual environment, spatialization entails encoding appropriate localization cues to the sound field at the listener’s ear entrances. Common approaches include spherical-harmonics based rendering (“Ambisonics”) [42, 67] as well as object-based rendering [17].

HRTFs. If the sound is played back to the listener via headphones, this implies simulating the filtering that sound undergoes in a real sound field as it enters the ear entrances, due to reflections and scattering from the listener’s torso, head, and pinnae. A convenient way to describe this filtering behavior is via the HRTFs. The HRTFs are a function of the direction of arrival and contain the localization cues that the human auditory system decodes to determine the direction of an incoming wavefront. HRTFs for a particular listener are usually constructed via measurements in an anechoic chamber [40], though recent efforts exist to derive HRTFs for a listener on the fly without an anechoic chamber [50, 61], by adapting or personalizing existing HRTF databases using anthropometric features [15, 38, 41, 89, 106], or by capturing image or depth data to model the HRTFs numerically [20, 58, 65]. For a review of HRTF personalization techniques, refer to Chap. 4 and see [48]. The HRTFs can be tabulated as two spherical functions \(H^{\{l,r\}}(s,t)\) that encapsulate the angle-dependent acoustic transfer in the free field to the left and right ears. The set of incident angles s contained in the HRTF dataset is typically dictated by the HRTF measurement setup [5, 39]. The process of applying HRTFs to a virtual source signal to encode localization cues is referred to as binaural spatialization.

Spatialization for loudspeaker arrays is also possible, commonly performed using channel-based methods such as Vector Base Amplitude Panning [72] or Ambisonics [42]. It is also possible to physically reproduce the virtual directional sound field using Wave Field Synthesis [2] with large loudspeaker arrays. For the rest of this chapter, we will focus on binaural spatialization, although most of the discussion can be easily adapted to loudspeaker reproduction as discussed in Chap. 5.

Spherical-harmonics based rendering. Various methods exist to spatialize acoustic scenes. A convenient description of directional fields is via spherical harmonics (SHs) or Ambisonics [43]. Given a SH representation of a scene, binaural ear input signals can be obtained directly via filtering with a SH representation of the listener’s HRTFs [29]. However, encoding complex acoustic scenes to SHs of sufficiently high order while minimizing audible artifacts can be challenging [10, 11, 19, 51]. The openly available Resonance Audio [47] system follows this approach.

Object-based rendering. In this chapter, we will follow the direct parameterization over time and angle of arrival which is also common in practice, such as the illustrative auralization system we discuss in Sect. 3.8.4. The system directly outputs signals and directions, suitable for spatialization by applying appropriate HRTF pairs. The description of the acoustic propagation problem from a source to the listener in terms of a directional sound field as presented in Sect. 3.3.4 results in a convenient interface between the propagation model and the spatialization engine.

This provides three major advantages. Firstly, it enables a modular system design that treats propagation modeling and (real-time) spatialization as separate problems that are solved by independent sub-systems. This separation in turn allows improving and optimizing the sub-systems individually and can lead to significant computational cost savings. Secondly, a description of a sound field enveloping the listener in terms of time and angle of arrival is equivalent to an object-based representation, which is a well-established input format for existing spatialization software, thus allowing the system designer to build easily on existing spatialization systems. Finally, psycho-acoustic research on perceptual limits of human spatial hearing, such as just-noticeable-differences, are expressed as a function of time and angle of arrival (Sect. 3.4). Knowledge of these perceptual limits can be exploited for further computational savings.

3 Mathematical Model

Auralization may be formalized as a linear, time-invariant process as follows. Assume a quasi-static state of the world at the current visual frame. To auralize a sound source, consider its current pose (position and orientation) to determine its directional sound radiation and then model propagation and spatialization as a feed-forward chain of linear filters. Those filters in turn depend on the current world shape and listener pose, respectively.

Notation. For the remainder of this chapter, for any quantity \((\star )\) referring to the listener, we use prime \((\star ')\) to denote a corresponding quantity referring to the source. In particular, x is listener location and \(x'\) source location. Temporal convolution is denoted by \(*\).

3.1 The Green’s Function

With the linearity and time-invariance assumptions, along with the absence of mean flow or wind, the Navier-Stokes equations simplify to the scalar wave equation that models propagating longitudinal pressure deviations from quiescent atmospheric pressure  [70]:

$$\begin{aligned} \left[ ({1}\big /{c^2})\,\partial _t^2-\nabla _x^2\right] p\left( t,x,x'\right) =\delta \left( t\right) \delta \left( x-x'\right) , \end{aligned}$$
(3.1)

where \(c=340\) m/s is the speed of sound, \(\nabla _x^2\) the 3D Laplacian operator ranging over x. The solution is performed on some 3D domain provided by the scene’s shape, with appropriate boundary conditions to model the frequency-dependent absorptivity of physical materials.

Sound propagation is induced by a pulsed excitation at time \(t=0\) and source location \(x'\) with \(\delta (\cdot )\) denoting the Dirac delta function. The solution \(p(t,x,x')\) is Green’s function that fully describes the scene’s global wave transport, including diffraction and scattering. The principle of acoustic reciprocity ensures that source and listener positions are interchangeable [70]:

$$\begin{aligned} p(t,x,x') = p(t,x',x) . \end{aligned}$$
(3.2)

For treating general scenes, a numerical solver must be employed to discretely sample Green’s function in space and time. This includes accurate wave-based methods that directly solve for the time-evolving field on a grid, or fast geometric methods that employ the high-frequency Eikonal approximation. We will discuss solution methods in Sect. 3.7.

In principle, Green’s function has complete information [3], including directionality, which can be extracted via spatio-temporal convolution of \(p(t,x,x')\) with volumetric source and listener distributions that can model arbitrary radiation patterns [13] and listener directivity [91]. But such an approach is too expensive for real-time evaluation on large scenes, requiring temporal convolution and spatial quadrature over sub-wavelength grids that need to be repeated when either the source or listener moves. Geometric techniques cannot follow such an approach at all, as they do not model wave phase.

This is where modularity (Sect. 3.2.2) becomes indispensable: the source and listener are not directly included within the propagation simulation, but are instead incorporated via tabulated directivity functions that result from their local radiation and scattering characteristics. Below, we formulate the propagation component of this modular approach, beginning with the simplest case of an isotropic source and listener, building up to a fully bidirectional representation that can be combined with arbitrary source and listener directivity during rendering.

3.2 Impulse Response

Consider an isotropic (omni-directional) sound source located at \(x'\) that is emitting a coherent pressure signal \(q'(t)\). The resulting pressure signal at listener location x can be computed using a temporal convolution:

$$\begin{aligned} q(t; x, x^\prime ) = q^\prime (t)~*~p(t; x, x^\prime ). \end{aligned}$$
(3.3)

Here, \(p(t; x, x^\prime )\) is obtained by evaluating Green’s function between the listener and source locations \((x, x^\prime )\). We denote this evaluation by putting them after semi-colon \(p(t; x, x^\prime )\) to signify they are held constant, yielding a function of time alone. This function is the (monoaural) impulse response capturing the various acoustic path delays and amplitudes from the source to the listener via the scene. The vibrational aspects of how the source event generated the sound \(q'(t)\) are abstracted away—it may be synthesized at runtime, or read out from a pre-recorded file and freely substituted.

3.3 Directional Impulse Response

The directional impulse response \(d(t,s;x,x')\) [32] generalizes the impulse response \(p(t;x,x')\) to include direction of arrival, s. Intuitively, it is the signal obtained by the listener if they were to point an ideal directional microphone in direction s when the source at \(x'\) emits an isotropic impulse.

Given a directional impulse response, spatialization for the listener can be performed to reproduce the directional listening experience via

$$\begin{aligned} q^{\{l,r\}}(t; x,x') = q'(t)*\int _{\mathcal {S}^2} d\left( t,s;x,x'\right) \,*\,H^{\{l,r\}}\left( \mathcal { \mathcal {R} }^{-1}(s),t\right) \, ds~, \end{aligned}$$
(3.4)

where \(H^{\{l,r\}}(s,t)\) are the left and right HRTFs of the listener as discussed in Sect. 3.2.4, \(\mathcal { \mathcal {R} }\) is a rotation matrix mapping from head to world coordinate system, and \(s \in \mathcal {S}^2\) represents the space of incident spherical directions forming the integration domain. Note the advantage of separating propagation (directional impulse response) from spatialization (HRTF application). The expensive simulation necessary for solving (3.1) can ignore the listener’s body entirely, which is inserted later taking its dynamic rotation \(\mathcal { \mathcal {R} }\) into account, via separately tabulated HRTFs as in (3.4).

3.4 Bidirectional Impulse Response (BIR) and Rendering Equation

The above still leaves out direction-dependent radiation at the source. A complete description of auralization for localized sound sources can be achieved by the natural extension to the bidirectional impulse response (BIR) [26]; an 11-dimensional function of the wave field, \(D(t,s,s';x,x')\), illustrated in Fig. 3.2. Analogous to the HRTF, the source’s radiation pattern is tabulated in a source directivity function (SDF), \(S(s',t)\), such that its radiated signal in any direction \(s'\) is given by \(q'(t)*S(t;s')\).

Fig. 3.2
figure 2

(figure adapted from [26])

Bidirectional impulse response (BIR). An impulse radiates from source position \(x'\), propagates through a scene, and arrives via two paths in this simple case at listener position x. The paths radiate in directions \(s_1'\) and \(s_2'\) and arrive from directions \(s_1\) and \(s_2\), respectively, with delays based on the respective path lengths. The bidirectional impulse response (BIR) denoted by \(D(t,s,s';x,x')\) contains this time-dependent directional information. Evaluating for specific radiant and incoming directions isolates arrivals, as shown on the right.

We can now write the (binaural) rendering equation:

$$\begin{aligned} q^{\{l,r\}}(t;x,x')&= q'(t)~*\nonumber \\&\iint \! D\left( t,s,s';x,x^{\prime }\right) *S\left( \mathcal { \mathcal {R} }'^{-1}(s'),t\right) *H^{\{l,r\}}\left( \mathcal { \mathcal {R} }^{-1}(s),t\right) \,ds\,ds'\!, \end{aligned}$$
(3.5)

where \(\mathcal { \mathcal {R} }\) is a rotation matrix mapping from the listener’s head to the world coordinate system, \(\mathcal { \mathcal {R} }'\) maps rotation from the source to the world coordinate system, and the double integral varies over the space of both incident and emitted directions \(s, s' \in \mathcal {S}^2\). A similar formulation can be obtained for speaker-based rendering by using, for instance, VBAP speaker panning weights [72] instead of HRTFs.

The BIR is convolved with the source’s and listener’s free-field directional responses \(S\) and \(H^{\{l,r\}}\), respectively, while accounting for their rotation since \((s,s')\) are in world coordinates, to capture modification due to directional radiation and reception. The integral repeats this for all combinations of \((s,s')\), yielding the net binaural response. This is finally convolved with the emitted signal \(q'(t)\) to obtain a binaural output that should be delivered to the entrances of the listener’s ear canals. Finally, if multiple sound sources are present, this process is repeated for each source and the results are summed.

Bidirectional decomposition and reciprocity. The bidirectional impulse response generalizes the more restrictive notions of impulse response in (3.4) and (3.3), illustrated in Fig. 3.2. The directional impulse response can be obtained by integrating over all radiating directions \(s'\) and yields directional effects to the listener for an omnidirectional source:

$$\begin{aligned} d(t,s;x,x') \equiv \int _{\mathcal {S}^2} D(t,s,s';x,x')\, ds' . \end{aligned}$$
(3.6)

Similarly, a subsequent integration over directions to the listener, s, yields back the monoaural impulse response, \(p(t;x,x')\).

The BIR admits direct geometric interpretation. With source and listener located at \((x',x)\), respectively, consider any pair of radiated and arrival directions \((s',s)\). In general, multiple paths connect these pairs, \((x',s')\rightsquigarrow (x,s)\), with corresponding delays and amplitudes, all of which are captured by \(D(t,s,s';x,x')\). Figure 3.2 illustrates a simple case. The BIR is thus a fully reciprocal description of sound propagation within an arbitrary scene. Interchanging source and listener, all propagation paths reverse:

$$\begin{aligned} D(t,s,s';x,x') = D(t,s',s;x',x) . \end{aligned}$$
(3.7)

This reciprocal symmetry mirrors that for the underlying wave field, \(p(t;x,x')=p(t;x',x)\) and requires a full bidirectional description. In particular, the directional impulse response is non-reciprocal.

3.5 Band-limitation and the Diffraction Limit

It is important to remember that the bidirectional impulse response is a mathematically convenient intermediate representation only, and cannot be realized physically. The only physically observed quantity is the final rendered audio, \(q^{\{l,r\}}(t;x,x')\). In particular, the BIR representation allows unlimited resolution in time and direction. The source signal, \(q'(t)\), is temporally band-limited for typical sounds, due to aggressive absorption in solid media and air as frequency increases. Similarly, auditory perception is limited to 20 kHz. Band-limitation holds for directional resolution as well because of the diffraction limit [16] which places a fundamental restriction on the angular resolution achievable with a spatially finite radiator or receiver.

For a propagating wavelength \(\lambda \), the diffraction-limited angular resolution scales as \(\mathcal {D}/\lambda \), where \(\mathcal {D}\) is the diameter of an enclosing sphere, such as around a radiating object, or the listener’s head and shoulders in the case of HRTFs [105]. Therefore, all the convolutions and spherical quadratures in (3.5) may be performed on a discretization with sufficient sub-wavelength resolution at the highest frequency of interest. Alternatively, it is common to perform time convolutions in frequency domain via the Fast Fourier Transform (FFT) for efficiency. Similarly, spherical harmonics (SH) form an orthonormal linear basis over the sphere and can be used to accelerate the spherical quadrature of function product to an inner product of spherical harmonic (SH) coefficients. An end-to-end auralization system using this approach was shown in [63].

4 Structure and Perception of the Bidirectional Impulse Response (BIR)

To explain how the theory outlined above can be put into practice, we will first review the physical and perceptual structure of the BIR, followed by a discussion of how auralization systems approximate in various ways.

Fig. 3.3
figure 3

(figure adapted from [26])

Structure of the bidirectional impulse response

4.1 Physical Structure

The structure of a typical (bidirectional) impulse response may be understood in three phases in time, as illustrated in Fig. 3.3. First, the emitted sound must propagate via the shortest path, potentially diffracting around obstruction edges to reach the listener after some onset delay. This is the initial (or “direct”) sound. The initial sound is followed by early reflections due to scattering and reflection from scene geometry. As sound continues to scatter multiple times from the scene, the temporal arrival density of reflections increases, while the energy of an individual arrival decreases due to absorption at material boundaries and in the air. Over time, with sufficient scattering, the response approaches decaying Gaussian noise, which is referred to as late reverberation. The transition from early reflections to late reverberation is demarcated by the mixing time [1, 98].

As we discuss next, each of these phases has a distinct contribution to the overall spatial perception of a sound. These properties of the human auditory perception play a key role in informing how one might approximate the rendering equation (3.5) within limited computational resources, while still retaining an immersive auditory experience. A more detailed review of perception of room acoustics can be found in [37] and [60]. All observations and terms below can be found in these references, unless otherwise noted.

4.2 Initial (“Direct”) Sound

Our perception strongly relies on the initial sound to localize sound sources, a phenomenon called the precedence effect  [62]. Referring to Fig. 3.3, if there is a secondary arrival that is roughly within 1 ms of the initial sound, we perceive a direction intermediate between the two arrival directions, termed summing localization, representing the temporal resolution of spatial hearing. Beyond this 1 ms time window, our perceptual system exerts a strongly non-linear suppression effect, so people do not confuse the direction of strong reflections with the true heading of the sound. Sometimes called the Haas effect, a later arrival may need to be as much as 10 dB louder than the initial sound to affect the perceived direction significantly. Note that this is not to say that the later arrival is not perceived at all, only that its effect is not to substantially change the localized direction.

Consider the case shown in Fig. 3.3, and assume the walls do not substantially transmit sound. The sound shown inside the room would be localized by the listener outside as arriving from the direction of the doorway, rather than the line of sight. Such cues are a natural part of how we navigate to visually occluded events in everyday life. The upshot is that in virtual reality, the initial sound path may be multiply-diffracted and must be modeled with particular care so that the user gets localization cues consistent with the virtual world.

4.3 Early Reflections

Early reflections directly affect the perception of source properties such as loudness, width, and distance while also informing the listener about surrounding scene geometry such as nearby reflectors. A copy of a sound following the initial arrival is perceptually fused up until a delay called the echo threshold , beyond which it is perceived as a separate auditory event. The echo threshold varies between 10 ms for impulsive sounds, through 50 ms for speech to 80 ms for orchestral music [62, Table 1].

The impact of the loudness of early reflections is important in two ways. Firstly, the perception of source distance is known to correlate with the energy ratio between initial sound and remaining response (whose energy mostly comes from early reflections), called the direct-to-reverberant ratio (DRR) [92]. This is often also called the “wet ratio” by audio designers. Secondly, how well one can understand and localize sounds depends on the ratio of the energy of direct sound and early reflection in the first 50 ms to the rest of the response, as measured by clarity (\(C_{50}\)).

The directional distribution of reflections conveys important detail about the size and shape of the local environment around the listener and source. The ratio of reflected energy arriving horizontally and perpendicular to the initial sound is called lateral energy fraction and contributes to the perception of spaciousness and affects the apparent source width. Further, in VR, strong individual reflections from surfaces close to the listener provide an important proximity cue [69].

Thus, an auralization system must model strong initial reflections as well as the aggregate energy and directionality of later reflections up to the first 80 ms to ensure important cues about the sound source and environment are conveyed.

4.4 Late Reverberation

The reverberation time, \(T_{60}\), is the time taken by the reverberant energy to decay by 60 dB. Since the reverberation contains numerous, lengthy paths through the scene, it provides a sense of the overall scene, such as its size. The \(T_{60}\) is frequency-dependent; the relative decay rate across various frequencies informs the listener about the acoustic materials in a scene and atmospheric absorption.

The aggregate directional properties of reverberation affect listener envelopment which is the perception of being present in a room and immersed in its reverberant field (see Chap. 11 and Sect. 11.4.3 for further discussions on related topics). In virtual reality, one may often be present outside a room containing sounds and any implausible envelopment becomes especially distracting. For instance, consider the situation in Fig. 3.3—rendering an enveloping room reverberation for the listener will sound wrong, since the expectation would be low envelopment.

5 System Design Considerations for VR Auralization

Many types of real-time auralization systems exist today that approximate the rendering equation (3.5), and in particular, how to evaluate the scene’s sound propagation (i.e., the BIR, \(D(t,s,s';x,x')\)) which is typically the most compute-intensive portion. They gain efficiency by making approximations based on the intended application, with a knowledge of the limits of auditory perception.

5.1 Room Auralization

The roots of auralization research lie in the area of computational modeling of room acoustics, an active area of research with developments dating back at least 50 years [7, 60]. The main objective of these computer models has been to aid in the architectural design of enclosures, such as offices, classrooms, and concert halls. The predictions of these models can then be used by acousticians to propose architectural design changes or acoustic treatments to improve the reverberant properties of a particular room or hall, such as speech intelligibility in a classroom. This requires models that simulate the room’s first reflections and reverberation with perceptual authenticity. The direct path in such applications can often be computed analytically since the line of sight is rarely blocked. We direct the reader to Gade’s book chapter [37] on the subject of room acoustics for an excellent summary of the requirements, metrics, and methods in the field from the viewpoint of concert hall design.

While initially the computer models could only produce quantitative estimates of room acoustic parameters, with increasing compute power, real-time auralization systems were proposed near the beginning of the millennium [86]. As we will discuss in more detail shortly, geometric methods are standard in the area today because they are especially well-suited for modeling a single enclosure where visual occlusion between sounds and listener is not dominant. This holds very well in any hall designed for speech or music. Room auralization is available today in commercial packages such as ODEON [82] and CATT [28].

5.2 VR Auralization

The concerns of real-time VR auralization are quite distinct along a number of dimensions, which result from going from individual room to a scene that can span entire city blocks with numerous indoor and outdoor areas. This results in a unique set of considerations that we enumerate below, for two reasons. Firstly, they provide a framing for understanding current research in the area and the trade-offs current systems make, which we will discuss in the following sections. Secondly, we hope that the concise listing of practical problems motivates new research in the area, as no system today can meet all these criteria.

  1. 1.

    Real time within limited computation. A VR application’s auralization component can usually only use a single or a few CPU cores for audio simulation at runtime, since resources must be shared with simulating other aspects of the world, such as rigid-body collisions, character animation, and AI path planning. In contrast, owing to the application, in room acoustic auralization one can consume a majority of the resources of a computer including the parallel compute power of modern graphics cards. With power-efficient mobile processors integrated into phones and standalone head-mounted displays, the pressure to minimize computation has only increased.

  2. 2.

    Scene complexity and non line of sight. Room acoustics theory often starts by assuming a single connected space such as a concert hall that has lines of sight from the stage to all listener locations. This allows for a powerful simplification of the sound field as an analytically computable direct sound combined with a diffuse reverberant field. Modern VR systems for building and game acoustics consider the much broader class of all scenes such as a building floor with many rooms, or a street canyon with buildings that may be entered. These are complex scenes not just in the sense of surface detail but also in that the air volume is topologically complex, with many concavities. As a result, non line of sight cases are common. For instance, hearing sounds in the same room with plausible reverberation can be as important as not hearing sounds inside another room, or hearing sounds from unseen sources diffracted around a corner or door.

  3. 3.

    Perception. Physical accuracy is important to VR auralization not as a goal in itself but rather in so far as it impacts sensory immersion. This opens opportunities for fast approximations, and deeply informs practical systems that scale their errors based on the acuity of the human auditory system. This observation underlies the deterministic-statistical decomposition discussed in the next section. Further, in many applications such as games, plausibility can be sufficient as a starting point, while for instance in auralizing building acoustics one might need perceptual authenticity.

  4. 4.

    Dynamic sounds. VR auralization must often support dynamic sound sources that can translate and rotate. The rendering must respond with low latency and without distracting artifacts, even for fast source motion. This adds significant complexity to a minimum-viable practical system. However, in architectural acoustic systems, static sound sources can be a feasible starting point.

  5. 5.

    Dynamic geometry.  In many applications, the scene geometry can be changed interactively. This may be while designing a virtual space, in which case an acoustical system for static scenes may re-compute on the updated geometry; depending on the system this can take seconds to hours. The more challenging case is when the geometry is changing in real time. The change might be “locally dynamic”, such as opening a door or moving an obstruction. Since such changes are localized in an otherwise static scene, many systems are able to model such effects. Lastly, the scene may be “globally dynamic”, where there might be unpredictable global changes, such as when a game player creates a building in Minecraft or Fortnite and expects to hear the audio rendering adapt to it in real time—while this has the most practical utility it is also the most challenging case.

  6. 6.

    Robustness. VR requires high robustness given unpredictable user inputs. This means the severity and frequency of large outlying errors may matter more than average error. For instance, as the listener moves quickly through a scene through multiple rooms, the variation in reverberation and diffracted occlusion must stay smooth reliably. This is a tightly restrictive constraint: a technique that has large outlying errors may not be viable in immersive VR regardless of its average error. As an example, an implausible error in calculating occlusion with only 0.1% probability for an experience running at 30 frames per second means distracting the user every 33 s on average. This deteriorates to 3.3 s with 10 sound sources and so on.

  7. 7.

    Scalability. The system should ideally expose compute-quality trade-offs along two axes. Firstly, VR scenes can contain hundreds to thousands of dynamic sound sources, and it is desirable if the signal processing can scale from high-quality rendering of a few sound sources to lower quality (but still plausible) rendering for numerous sound sources. Secondly, the acoustical simulation should also allow methods for reducing quality gracefully as scene size increases. For instance, high-quality propagation modeling of a conference room, up to a rough simulating of a city.

  8. 8.

    Automation. For VR applications, it is preferable to avoid any per-scene manual work, such as geometric scene simplification. Game scenes in particular can span over kilometers with multiple buildings designed iteratively during the production process. This makes manual simplification a major hurdle for practical usage. The auralization system must ideally directly ingest complex scenes with millions of polygons, and perform any necessary simplification while minimizing any human expertise or input, unlike in room auralization.

  9. 9.

    Artistic direction. VR often requires the final rendering to be controlled by a sound designer. For instance, the reverberation and diffracted occlusion on important dialogue might be reduced to boost speech intelligibility in a game. Or one might want to re-map the dynamic range of the audio rendering with the limits of the audio reproduction system or user comfort in mind. A viable system must provide methods that allow such design intent to be expressed and influence the auralization process appropriately.

6 Rendering the BIR: the Deterministic-Statistical Decomposition

A powerful technique employed by most real-time auralization systems is to decompose the BIR as a sum of a deterministic and statistical component. This is deeply informed by acoustical perception (Sect. 3.4) and is key to enabling the computational trade-offs VR auralization must contend with, as described in the prior section. The initial sound and strong early reflections, such as sound heard via a portal or echoes heard from nearby large surfaces, are treated deterministically: that is, simulated and rendered in physical detail, and updated in real time based on the dynamic source and listener pose and scene geometry. Weak early reflections and late reverberation are represented only statistically, ignoring the precise details of each of the amplitudes and delays of thousands of arrivals or more, which are perceived in aggregate.

To formalize, the BIR is decomposed as

$$\begin{aligned} D(t,s,s';x,x') = D_d(t,s,s';x,x') + D_s(t,s,s';x,x'). \end{aligned}$$
(3.8)

Referring to Fig. 3.3, the initial sound and early reflection spikes deemed perceptually salient can be included accurately in \(D_d\). The residual is \(D_s\), which is usually modeled as noise characterized by its perceptually relevant statistical properties.

Substituting into the rendering equation (3.5) and observing linearity, we have

$$\begin{aligned} q^{\{l,r\}}(t;x,x') = \sum _{\{d,s\}} q'(t)~*\iint \! D_{\{d,s\}} *S\left( \mathcal { \mathcal {R} }'^{-1}(s'),t\right) *H^{\{l,r\}}\left( \mathcal { \mathcal {R} }^{-1}(s),t\right) \,ds\,ds'\!, \end{aligned}$$
(3.9)

so that the input mono signal, \(q'(t)\), is split off as input into separate filtering processes for the two components, whose binaural outputs are summed. This is a fairly standard architecture followed by both research and commercial systems, as the two components may be approximated independently with perception and the particular application in mind. For the remainder of this section, we will assume the BIR components have been computed and focus on the signal processing for rendering. The next section will discuss how this decomposition informs the design of acoustic simulation methods.

6.1 Deterministic Component, \(D_d\)

The deterministic component, \(D_d\), is typically represented as a set of \(n_d\) peaks:

$$\begin{aligned} D_d(t,s,s';x,x') \approx \sum _{i=0}^{n_d-1} a_i(t)*\delta (t-\tau _i)~\delta (s'-s'_i)~\delta (s-s_i). \end{aligned}$$
(3.10)

Each term represents an echo of the emitted impulse that arrives at the listener position after a delay of \(\tau _i\) from world direction \(s_i\), having been previously radiated from the source in world direction \(s'_i\) at time \(t=0\). The amplitude filter \(a_i(t)\) captures transport effects along the path from edge diffraction, scattering, and frequency-dependent transmission/reflection from scene geometry. Note that the amplitude filter is causal, i.e., \(a_i(t) = 0\) for \(t<0\), and by convention \(\tau _{i+1}>\tau _i\). The parameter \(n_d\) is key for trading between rendering quality and computational resources. It is usual to at least treat the initial sound path deterministically (i.e., \(n_d\ge 1\)) because of its high importance for localization due to the Precedence Effect. Audio engines will usually designate this (\(i=0\)) as the “dry” path with separate design controls due to its perceptual importance.

Substituting from (3.10) into Eq. (3.9), we get

$$\begin{aligned} q^{\{l,r\}}_d(t) = \sum _{i=0}^{n_d-1} q'(t)~*~\delta (t-\tau _i) *a_i(t) *S\left( \mathcal { \mathcal {R} }'^{-1}(s'_i),t\right) \! *H^{\{l,r\}}\left( \mathcal { \mathcal {R} }^{-1}(s_i),t\right) . \end{aligned}$$
(3.11)

Thus, each path’s processing is a linear filter chain whose binaural output is summed to render the deterministic component to the listener. Reading the equation from left to right: for each path, take the monophonic source signal and input it to a delay line. Read the delay line at (fractional) delay \(\tau _i\) and filter the output based on amplitude filter \(a_i\), then filter it based on the source’s radiation pattern. The lookup via \(\mathcal { \mathcal {R} }'^{-1}(s'_i)\) signifies that one must rotate the radiant direction of the path from world space to the local coordinate system of the source’s spherical radiation pattern data.

Finally, the last factor makes concrete the modularity shown in Fig. 3.1: the resulting monophonic signal from this prior processing is sent to the spatializer module as arriving from direction \(\mathcal { \mathcal {R} }^{-1}(s_i)\) relative to the listener. One is free to substitute any spatializer to separately trade off quality and speed of spatialization versus other costs and priorities for the system. One could even use multiple spatialization techniques, such as high-quality spatialization for the initial path, and lower fidelity for reflections. In a software implementation, the spatializer often acts as a sink for monophonic signals, processing each, mixing their outputs, and sending them to a low-level audio engine for transmission to transducers, thus performing the summation in (3.11) as well.

Similar to the choice of spatializer, the details of all other filtering operations are highly flexible. For the amplitude filter \(a_i\), the simplest realization is to multiply by a scalar for average magnitude over frequencies, thus representing arrivals with idealized Dirac spikes. But for the initial sound filter \(a_0\), even in a minimalistic setting it is common to apply a low-pass filter to capture the audible muffling of visually occluded sounds. A more accurate implementation accounting for frequency-dependent boundary impedance could use equalization filters in octave bands. For source directivity, it is common to measure and store radiation patterns as third-octave or octave-band data tabulated over the sphere of directions while ignoring phase. Convolution can then be realized via modern fast graphic equalizer algorithms that employ recursive time-domain filters [68].

The commutative and associative properties of convolution are a powerful tool to optimize signal processing. The ordering of filters in (3.11) has been chosen to illustrate this. The delay is applied in the very first operation. This makes it so that we only need one single-write-multiple-read delay line shared across all paths. The signal \(q'(t)\) is written as input, and each path reads out at delay \(\tau _i\). This is a commonly used optimization. Further, one may then use the associative property to group the factors: \(a_i(t)*S\left( \mathcal { \mathcal {R} }'^{-1}(s'_i),t\right) \). If both are implemented, say, using an octave-band graphic equalizer, then the per-band amplitudes can be multiplied first and provided to a single instance of the equalizer—a nearly two-fold reduction in equalization compute. These optimizations illustrate the importance of linearity and modularity in the efficient implementation of auralization systems.

6.2 Statistical Component, \(D_s\)

The central concept for rendering the statistical component, \(D_s\), is to use an analysis-synthesis approach [56]. The analysis phase does lossy perceptual coding of the statistical component of the BIR, \(D_s\), to compute \(\bar{D}_s\) as the energy envelope of the response summing over time, frequency, and direction. We use the over-bar notation \(\bar{f}(\bar{y})\) to indicate that y is sub-sampled, and f’s corresponding energy is appropriately summed at each sample of \(\bar{y}\) without loss via some windowing. For instance, if p(t) is an impulse response, \(\bar{p}(\bar{t})\) indicates the corresponding echogram, which is the histogram of \(p^2(t)\) sampled at some time-bin centers, \(\bar{t}\). This notation is introduced to indicate the reduction in the sampling rate of y, and loss of fine structure information in f at its original sampling rate, such as phase.

Parametric reverberation. During real-time rendering, the description captured in \(\bar{D}_s\) can be synthesized using fast parametric reverberation techniques: the “parameters” being statistical properties that determine \(\bar{D}_s\), as we will discuss. The key advantage is that since the fine structure of the response in time, frequency, and direction is left unspecified, one has vast freedom in choosing efficient techniques. These techniques often rely on recursive time-domain filtering which can potentially make the CPU cost far smaller than applying a few seconds long filter via frequency-domain convolution. The research problem is to make the artificial reverberation sound natural. Among other concerns, the produced reverberation must have realistically high temporal echo density and sound colorless, not introducing perceivable spectral or temporal modulations that cannot be controlled. For further reading, we point readers to the extensive survey in [99]. In the following, we focus on how one might characterize \(\bar{D}_s\).

Energy Decay Relief (EDR). The EDR [56] is a central concept for statistical encoding of acoustical responses. Consider a monoaural impulse response, p(t). The EDR, \(\bar{p}(\bar{t},\bar{\omega })\), is computed by performing short-time Fourier analysis on p(t) to compute how its energy spectral density integrated over perceptual frequency bands with centers \(\bar{\omega }\) varies over time-bin centers \(\bar{t}\). It can be visualized as a spectrogram. Frequency dependence results from materials of the boundary (e.g., wood tends to be more absorbent at high frequencies compared to concrete) and atmospheric absorption. Frequency band centers are typically spaced by octaves for real-time auralization, and time bins typically have a width of around 10 ms.

The reduced sampling rate makes the EDR, \(\bar{p}\), already quite compact compared to p, which is a highly oscillatory noisy signal at audio sample rates. Further, the EDR is smooth in time: it exhibits slow variation during early reflections (especially if the strong peaks have been separated out already into \(D_d\)) followed by monotonic decay during late reverberation. This opens up many avenues for a low-dimensional description with a few parameters. For instance, for a single enclosure, the EDR in each frequency band may be well-approximated by an exponential decay, resulting in a compact description for the late reverberation parameterized by the initial energy, \(\bar{p}_0\), and 60-dB decay time, \(T_{60}\) in each frequency band:

$$\begin{aligned} \bar{p}(\bar{t},\bar{\omega })\approx \bar{p}_0(\bar{\omega }) 10^{-6 \bar{t}/T_{60}(\bar{\omega })}. \end{aligned}$$
(3.12)

Apart from substantial further compression, the great advantage of such a parametric description is that it is easy to interpret, allowing artistic direction. Reverberation plugins will typically provide \(\bar{p}_0\) as a combination of a broadband “wet gain” and a graphic equalizer, as well as the decay times, \(T_{60}(\bar{\omega })\) over frequency bands. For interactive auralization, the artist can exert aesthetic control by the simple means of modifying the reverberation parameters produced from acoustic simulation. For instance, when the player enters a narrow tunnel in VR, footsteps might get a realistic initial power (\(\bar{p}_0\)) to convey the constricted space, yet speech might have the wet gain reduced to increase the clarity (\(C_{50}\)) and improve the intelligibility of dialogue.

Bidirectional EDR. For an enclosure where conditions approach ideal diffuse reverberation, the EDR can be a sufficient description. Parametric reverberators will typically ensure that the same EDR is realized at both the ears but that the fine structure is mutually decorrelated, so that the reverberation is perceived by the listener as outside their head. However, in VR applications it becomes important to model the directionality inherent in reverberation because it can become strongly anisotropic. For instance, a visually occluded sound in another room heard through a door will be temporally diffuse, but directionally localized towards the door.

The concept of EDR can be extended naturally to the bidirectional EDR, \(\bar{D_s}(\bar{t},\bar{\omega },\bar{s},\bar{s'};x,x')\), which adds dependence on direction for both source and listener. It can be constructed and interpreted as follows. Consider a source located at \(x'\) that radiates a Dirac impulse in a beam centered around directional bin center \(\bar{s'}\). After propagating through the scene, it is received by the listener at location x, who beam-forms in the direction \(\bar{s}\) and then computes the EDR on the received time-dependent signal. The bidirectional EDR thus captures the frequency-dependent energy decay for all direction-bin pairs \({\{}\bar{s},\bar{s'}{\}}\).

Invoking the exponential decay model, the bidirectional EDR may be approximated as

$$\begin{aligned} \bar{D_s}(\bar{t},\bar{\omega },\bar{s},\bar{s'};x,x') \approx \bar{p}_0(\bar{\omega },\bar{s},\bar{s'};x,x') 10^{-6 \bar{t} /T_{60}(\bar{\omega },\bar{s},\bar{s'};x,x')}. \end{aligned}$$
(3.13)

Due to the curse of dimensionality, simulating and rendering the bidirectional EDR can get quite costly despite the simplifications. In practice, one must choose the sampling resolution of all the parameters judiciously depending on the application. An extreme case of this is when we sum over the entire range of a parameter, effectively removing it as a dimension.

Let’s consider one example that illustrates the kind of trade-offs offered by statistical modeling in balancing rendering quality and computational complexity. One may profitably compute the \(T_{60}\) for energy summed over all listener directions s, and source directions \(s'\), which amounts to computing the monophonic EDR to derive the reverberation time. In that case, one obtains a simplified hybrid approximation:

$$\begin{aligned} \bar{\bar{D_s}}(\bar{t},\bar{\omega },\bar{s},\bar{s'};x,x') \approx \bar{p}_0(\bar{\omega },\bar{s},\bar{s'};x,x') 10^{-6 \bar{t} / T_{60}(\bar{\omega };x,x')}. \end{aligned}$$
(3.14)

The first factor still captures strong anisotropy in reverberant energy, such as reverberation heard by a listener as streaming from a portal, or reverberant power being higher when a human speaker faces a close by reverberant chamber rather than away. In fact, \(\bar{p}_0(\bar{\omega },\bar{s},\bar{s'};x,x')\) can be understood as a multiple-input-multiple-output (MIMO) frequency-dependent transfer matrix for incoherent energy between a source and receiver for directional channels sampled via \(s'\) and s, respectively. The approximation lies in the second factor—directionally varying decay times for a single sound source are not modeled, which may be quite subtle to perceive in many cases.

7 Computing the BIR

Acoustic simulation is the key computationally expensive task in modern auralization systems due to the high complexity of today’s virtual scenes. In particular, at every visual frame, for all source and listener pairs with locations \((x,x')\), the system must compute the BIR \(D(t,s,s';x,x')\), which may then be applied on each source’s audio as discussed in the prior section. There are two distinct ways the problem may be approached: geometric and wave-based methods. In this section, we will discuss the fundamental ideas behind these techniques.

7.1 Geometric Acoustics (GA)

Geometric methods approximate sound propagation via the zero-wavelength (infinite frequency) asymptotic limit of the wave equation (3.1). Borrowing terminology from fluid mechanics, this yields a Lagrangian approach, where packets of energy are tracked explicitly through the scene as they travel along rays and repeatedly scatter into multiple packets in all directions each time they hit the scene boundary. The key strength of geometric methods is speed and flexibility: compared to a full-bandwidth wave simulation, tracing rays can be much cheaper, and it is much easier to incorporate physical phenomena and construct the BIR, assembled by explicitly constructing paths connecting source to listener. Today, these methods are standard in the area of room auralization.

Their key challenge falls into two categories. Firstly, one must efficiently search for paths connecting source to listener via complex scenes. Searching costs computation. Doing too little can under-sample the response, causing audible jumps in the rendering. Secondly, diffraction at audible wavelengths must be considered explicitly (since it is not present by default) to ensure plausibility. Both must be incorporated while balancing smooth rendering for moving sources and listener against the CPU cost of geometric analysis inherent in path search.

Below, we briefly elaborate on the general design of GA systems and practical implications for VR auralization, and refer the reader to Savioja and Svensson’s excellent survey on the recent developments in GA techniques [87].

Simplified geometry. Due to the zero-wavelength approximation, geometric methods remain sensitive to geometric detail indefinitely below audible wavelengths. For instance, if one directly used a visual mesh for GA simulation, a coffee mug can create a strong audible echo if the source and listener are connected by a specular reflection path hitting the cup. Such specular glints are observed for light, but not sound with its much longer wavelength. So, it becomes important to build an equivalent simplified acoustical model of the scene which captures only large facets, combined with coefficients that summarize scattering due to diffraction. For instance, the seating area in a concert hall might be replaced with an enclosing box with an equivalent scattering coefficient. This process requires the user to have a degree of acoustical expertise, and inaccuracies can result without carefully specified geometry and boundary data [21]. However, for VR auralization, automation is highly desirable, with some recent work along these lines [88].

Deterministic-statistical decomposition. Geometric methods directly incorporate the deterministic-statistical decomposition in the simulation process to reduce CPU burden. In particular, the two components \(D_d\) and \(D_s\) are typically computed and rendered separately and then mixed in the final rendering to balance quality and speed.

GA methods perform a deterministic path search only up to a certain number of bounces on the scene boundary, called the reflection order. This is a key parameter for GA systems because it has a sensitive impact on both performance and rendering quality, varying by system and application. Typically, the user can specify this parameter, which then implicitly determines the number of deterministic peaks rendered, \(n_d\), in (3.10). To accelerate path search, early methods [86] proposed using the image source method [7], which is well-suited for single enclosures but scales exponentially with reflection order and does not account for edge diffraction.

Following work on beam tracing, [36] showed that in multi-room scenes, precomputing a beam-tree data structure can at once control the exponential scaling and also incorporate edge diffraction which is crucial for plausibility in such densely occluded scenes. The system introduced precomputation as a powerful technique for reducing runtime acoustics computation, which most modern systems employ at least to some degree.

A key general concept employed in the beam tracing work in [36] is the room-portal decomposition : an indoor scene with many rooms is approximately decomposed into a set of Simplicial convex shapes that represent room volume, connected by flat portals representing doors. This is a frequently used method in GA systems, as it allows efficient deterministic path search on the discrete graph formed by rooms as nodes and portals as connecting edges. However, room-portal decomposition does not generalize to outdoor or mixed scenes, which is a key limitation that recent research is focusing on to allow fast deterministic search of high-order diffraction paths [34, 88].

Techniques developed for light transport in the computer graphics community are a great fit for computing the statistical component owing to its phase incoherence. Many methods are possible, such as those based on radiosity [8, 93]. Stochastic path tracing is a standard method in both graphics and acoustics communities today, used originally by DIVA [86] and in modern systems like RAVEN [90]. More recent improvements use bidirectional path tracing [24], which directly exploits the bidirectional reciprocity principle (3.7) to accelerate computation.

GA methods cannot construct the fine structure of the reverberant portion of the response, but as we discussed in Sect. 3.6.2, it is often sufficient to build the bidirectional energy decay relief, \(\bar{D_s}(\bar{t},\bar{\omega },\bar{s},\bar{s'};x,x')\), or some lower dimensional approximation ignoring directionality. With path tracing techniques, this is directly accomplished by accumulating into a histogram indexed on all the function parameters—each path represents an energy packet that accumulates into its corresponding histogram bin. The key parameter trading quality and cost is the number of paths sampled so that the energy value in each histogram bin is sufficiently converged.

With simplified scenes admitting a room-portal decomposition one can expect robust convergence, or even use approximations that avoid path tracing altogether [94], but for path tracing in complex VR scenes, the required number of paths for a converged histogram can vary significantly based on source and listener locations, \(\{x,x'\}\). For instance, if they are connected only through a few narrow apertures in the scene, it can be hard to find connecting paths despite extensive random searching. There is precedence for such issues in computer graphics as well [101], representing a frontier for new research with systematic convergence studies, as initiated in [24].

7.2 Wave Acoustics (WA)

Wave acoustic  methods take an Eulerian approach: space time is discretized onto a fictitious background, such as a uniform discrete grid, and then one updates pressure amplitude in each cell at each time-step. Paths are not constructed explicitly, so as energy scatters in various directions from scene surfaces, the amount of information tracked does not change. Thus, arbitrary combinations of diffraction and scattering are naturally captured by wave methods. By running a simulation with a source located at \(x'\), a discrete approximation of Green’s function \(p(t,x;x')\) is directly produced by running a volumetric simulation for a sufficient duration. The BIR \(D(t;s,s',x,x')\) may then be computed via accurate plane-wave decomposition in a volume centered at the source and listener location  [2, 91] or via the much faster approximation using instantaneous flux density [26], first applied to audio coding in [74].

Numerical solvers. The main challenge of wave methods is their computational cost. Since wave solvers directly resolve the detailed wave field by discretizing space and time, their cost scales as the fourth power of the maximum simulated frequency and third power of the scene diameter, due to Nyquist criteria as outlined in Sect. 3.2.1. This made them outright infeasible for most practical uses until the last decade, apart from low-frequency modal simulations up to a few hundred Hertz. However, they have seen a resurgence of interest over the last decade, with many kinds of solvers being actively researched today for auralization, such as spectral methods [52, 77], finite difference methods [49, 85], and the finite element method [71, 103]. Alongside the progress in numerical methods, the increased computational power of CPUs and graphics processors, as well as the availability of increased RAM, now allows simulations of practical cases of interest, such as concert halls, up to mid-frequencies (1 kHz and beyond). This is still short of complete audible bandwidth, and it is common to use approximate extrapolation beyond the band-limit frequency. The compute times remain suitable only for off-line computation, ranging in a few hours. The availability of commodity cloud computation has further aided the wider applicability of wave methods despite the cost.

Precomputation and static scenes. The idea of precomputation has been central to the increasing application of wave methods in VR auralization. Real-time auralization with wave methods was first shown to be viable for complex scenes in [80]. The method performs multiple simulations off-line and the resulting (monophonic) impulse responses are encoded and stored in a file. At runtime, this file is loaded, and the sampled acoustical data are spatially interpolated for a dynamic source and listener which informs spatialization of the source audio. This overall architecture is followed by most wave-based auralization methods.

The disadvantage of precomputation is that it is limited to static scenes. However, it has the great benefit that the fidelity of acoustical simulation becomes decoupled from runtime CPU usage. One may perform a detailed simulation directly on complex scene geometry ensuring robust results at runtime. These trade-offs are highly analogous to “light baking” which is a common feature of game engines today: expensive global illumination is simulated beforehand on static scenes to ensure fast runtime rendering. Similar to developments in lighting, one can conceivably incorporate local dynamism such as additional occlusion from portals [76] or moving objects [84] in the future.

Parametric encoding. The key research challenge introduced by precomputation is that the BIR field \(D(t,s,s',x,x')\) is 11-dimensional and highly oscillatory. Capturing it in detail can easily take an impractical amount of storage. Spatial audio coding methods such as DirAC [73, 74] demonstrate a path forward, in that they extract and render perceptual properties from directional audio recordings rather than trying to re-create the physical sound field. This in turn is similar in spirit to audio coding methods such as MP3 where precise waveform reconstruction is eschewed in favor of controllable trade-offs between perceived quality and compressed size.

These observations have motivated a new thread of auralization research on wave-based parametric methods [26, 78, 79] that combine precomputed wave acoustics with compact, perceptual coding of the resulting BIR fields. Such methods are practical enough today to be employed in many gaming applications. The deterministic-statistical decomposition plays a crucial role in this encoding stage, as we will elaborate in Sect. 3.8.4 when we discuss [26] in more detail.

Physical encoding. In a parallel thread, there has been work on methods that directly approximate and convolve the complete BIR without involving perceptual coding. The equivalent source method was proposed in [63, 64], at the expense of restricting to scenes that are a sparse set of exterior-scattering building facades. More recent methods for high-quality building auralization have been developed, which sample and interpolate BIRs for dynamic rendering  [55]. The advantage is that no inherent assumptions are made about the perception or the structure of the BIR, but in turn, such systems tend to be more expensive and current technology is limited to static sound sources.

8 Auralization Systems

In this section, we will discuss a few illustrative example systems in more detail. We emphasize that this should not be interpreted as a representative survey. Instead, our aim is to illustrate how the design of practical systems can vary widely depending on the intended application, chosen algorithms, and in particular how systems choose to prioritize a subset of the design constraints (Sect. 3.5). Most of these systems are available for download and experimentation.

8.1 Room Acoustics for Virtual Environments (RAVEN)

RAVEN [90] is a research system built from the ground up aiming for perceptually authentic and real-time auralization in VR. The computational budget is thus on the high side, such as all the resources of a single or few networked computers. This is in line with the intended application: for an acoustician evaluating a planned design, it is more important to hear a result with reliable predictive value, and the precise amount of computation does not matter as long as it is real time. RAVEN is a great example of the archetypal decisions involved in the end-to-end design of modern real-time geometric systems.

A key assumption in the system is that the scene is a typical building floor. Many decisions and efficiencies flow naturally. Chiefly, one can employ the room-portal decomposition as discussed in Sect. 3.7.1. Local scene dynamism is also allowed by the system, such as opening or closing doors, with limited precomputation on the scene geometry. However, like most geometric acoustic systems, the scene geometry has to be manually simplified with acoustical expertise to achieve the simplified cells required by rooms and portals. Flexible signal processing that can include artistic design need not be considered, since the application is physical prediction.

RAVEN models diffraction on both the deterministic and statistical components of the BIR. The former uses the image source method, with reflection orders up to 3 for real-time evaluation. Edge sources are introduced to account for diffraction paths that, e.g., first undergo a bounce from a flat surface and then diffract around a portal edge. Capturing such effects is especially important for smooth results on dynamic source and listener motion, which RAVEN carefully models.

The statistical component uses stochastic ray tracing with improved convergence using the “diffuse rain” technique [90]. To model diffraction for reverberation, a probabilistic scheme is used [95] that deflects rays that pass close enough to scene edges. Since the precise reconstruction of the reverberant characteristics is of central importance in architectural acoustics, RAVEN models the complete bidirectional energy decay relief, as illustrated in [90, Fig. 5.19].

8.2 Wwise Spatial Audio

Audiokinetic’s Wwise [9] is a commonly employed audio engine in video games, alongside many other audio design applications. Wwise provides both geometric acoustical simulation and HRTF spatialization using either object-based or spherical-harmonic processing (Sect. 3.2.4). The system stands in illustrative contrast to RAVEN, showing how different application needs can deeply shape technical choices of auralization systems. A detailed description of ideas and motivation can be found in the series of white papers [23].

Gaming applications require very low CPU utilization (fraction of a single core) without requiring physical accuracy. But one needs to approximate carefully. The rendering must stay perceptually believable, such as smooth acoustic changes on fast source motion or visual occlusion. Minimizing precomputation is desirable for reducing artist iteration times. Finally, the ability of artists to interpret the acoustic simulation and design the rendered output is paramount.

To meet these goals, Wwise also starts with a deterministic-statistical decomposition. Like most geometric systems, the user must provide a simplified audio geometry for the scene, which is the bulk of the work. Once this is done, the system responds interactively without precomputation. The initial sound is derived based on an explicit path search on simplified geometry at runtime, with reflections modeled via image sources up to some user-controlled reflection order (usually ~3 for efficiency).

Importantly, rather than estimating diffraction losses based on physical approximations such as the Uniform Theory of Diffraction [59] that cost CPU, the system exposes an abstract “diffraction coefficient” that varies smoothly as the sound source, and corresponding image sources transition between visual occlusion and visibility. This ameliorates the key perceptual deficit of audible loudness jumps that result when diffraction is ignored. The audio designer can draw a function in the user interface to map the diffraction coefficient to loudness attenuation. This design underlines how practical systems balance CPU cost, plausible rendering, and artistic control. Note how just reducing accuracy to gain CPU is not the path taken: instead, one must carefully understand which physical behaviors must be preserved to not violate our (stringent) sensory expectations, such as that sound fields rarely show a sudden audible variation on small movement in everyday life.

For modeling the statistical component, the system avoids costly stochastic ray tracing in favor of reverberation flow modeled on a room-portal decomposition of the simplified scene. The design is in the vein of [94], with diffuse energy flow on a graph composed of rooms as nodes and portals as edges. However, in keeping with the primary goal of audio design, the user is free to choose or parametrically design individual filters for each room, while the system ensures that the net result correctly accumulates reverberation and spatializes it as streaming to the listener from (potentially) multiple portals. Again, plausibility, performance, and design are prioritized over adherence to accuracy, keeping in mind the primary use case of scalable rendering for games and VR.

8.3 Steam Audio and Resonance Audio

Steam Audio [100] and Resonance Audio [46] are geometric acoustics systems also designed for gaming and VR applications with similar considerations as Wwise Spatial Audio. They both offer HRTF spatialization combined with geometric acoustics modeling; however, diffraction is ignored. A distinctive aspect of Steam Audio is the capability to precompute room reverberation filters (i.e., the statistical component) directly from scene geometry without requiring any simplification, auralized dynamically based on listener location. Resonance Audio on the other hand primarily focuses on highly efficient spatialization [47] that scales down to mobile devices for numerous sources, using up to third-order spherical harmonics. In fact, Resonance Audio can be used as a plugin within the Wwise audio engine to perform spatialization, illustrating the utility of the modular design of auralization systems (Sect. 3.2).

8.4 Project Acoustics (PA)

We now consider a wave-based system, Project Acoustics [66], which has shown practical viability for gaming [81] and VR [45] experiences recently. We summarize its key design ideas here; technical details can be found in  [26, 78, 79]. As is typical for wave acoustics systems (Sect. 3.7.2), costly simulation is performed in a precomputation stage, shown on the left of Fig. 3.4. Many simulations are performed in parallel that collectively sample and compress the entire BIR field \(D(t,s,s',x,x')\) into an acoustic dataset. With today’s commodity cloud computing resources, complete game scenes may be processed in less than an hour.

Fig. 3.4
figure 4

High-level architecture of Project Acoustics’ wave-based parametric auralization

The bidirectional reciprocity principle (3.7) plays an important role. The listener location, x, is typically restricted in motion to head height above the ground, thus varying in two dimensions rather than three, such as the floors of a building. Potential listener locations are sampled in the lowered dimension adapting to local geometry [25]. Note that source locations, \(x'\), may still vary in three dimensions. Then, a series of 3D wave simulations are performed with each potential listener location acting as source during simulation. The reduction in BIR field dimension by one yields an order-of-magnitude reduction in data size.

Project Acoustics’ main idea is to employ lossy perceptual encoding on the BIR field to bring it within practical storage budgets of a few hundred MB. The deterministic-statistical decomposition is employed at this stage. The initial arrival time and direction are encoded explicitly to ensure the correct localization of the sound, and the rest of the response is encoded statistically (i.e., \(n_d=1\) referring to Sect. 3.6.1). An example simulation snapshot is shown in Fig. 3.4 with the corresponding initial path encoding visualized on the right. Color shows frequency-averaged loudness, and arrows show the localized direction at the listener location, x, with the source location \(x'\) varying over the image. For instance, any source inside the room would be localized by the listener as arriving from the door, so the arrows inside the room consistently point in the door-to-listener direction. The perceptual parameters vary smoothly over space, mirroring our everyday experience, allowing further compression via entropy coding [78].

The statistical component simplifies (3.14) further to average over all simulated frequencies, approximating the bidirectional energy decay relief as

$$\begin{aligned} \bar{\bar{D_s}}(\bar{t},\bar{\omega },\bar{s},\bar{s'};x,x') \approx \bar{p}_0(\bar{s},\bar{s'};x,x') 10^{-6\bar{t}/T_60(x,x')}. \end{aligned}$$

The directions \({\{}\bar{s},\bar{s'}{\}}\) sample the six signed Cartesian directions, thus discretizing \(\bar{p}_0\) to a \(6\times 6\) “reflections transfer” matrix that compactly approximates directional reverberation, alongside a single \(T_{60}\) value across direction and frequency. Visualizations of the reflections transfer matrix can be found in [26] that illustrate how it captures anisotropic effects like directional reverberation from portals or nearby reverberant chambers.

One can observe that this encoding is quite simplified and can be expected to only plausibly reproduce the simulated BIR field. The choices result from the system’s goal: capturing key geometry-dependent audio cues within a compact storage budget—too large a size simply obviates practical use. For instance, one could encode much more detailed information such as numerous (\(n_d\sim 20\!-\!50\)) individual reflection peaks [80] but that is far too costly, in turn motivating recent research on how one might trade between number of encoded peaks (\(n_d\)) and perceived authenticity [18].

Generally speaking, precomputed systems shift the trade-off from quality-versus-CPU as with runtime propagation simulation to quality-versus-storage (Sects. 3.8.1 and 3.8.2). This holds regardless of whether the precomputation is geometric (Steam Audio) or wave-based (Project Acoustics). Precomputation can introduce limitations such as slower artist turnaround times and static scenes, but in return significantly lowers the barrier to viability whenever the available CPU is severely restricted, which is the case for gaming applications or untethered VR platforms.

Wave simulation forces precomputation in today’s systems due to its high computational cost, but its advantage compared to geometric methods is that complex visual scene geometry is processed directly, without requiring any manual simplification. Further, arbitrary order of diffraction around detailed geometry in general scenes (trees, buildings, chairs, etc.) is modeled, which avoids the risk of not sampling a salient path. In sum, one pays a high, fixed precomputation cost largely insensitive to scene complexity, and if that is feasible, obtains robust results directly from visual geometry with a low CPU cost.

As discussed in Sect. 3.6.2, parametric approaches enable intuitive controls for sound designers, which is of crucial importance in gaming applications, as we also saw in the design of the Wwise Spatial Audio system. In the case of PA, the parameters are looked up at each source-listener location pair at runtime (right of Fig. 3.4), and it becomes possible for the artist to specify dynamic aesthetic modifications of the physically-based baseline produced by simulation [44]. The sounds and modified acoustic parameters can then be sent to any efficient parametric reverberation and spatialization sub-system for rendering the binaural output.

9 Summary and Outlook

Creating an immersive and interactive sonic experience for virtual reality applications requires auralizing complex 3D scenes robustly and within tight real-time constraints. To meet these requirements, real-time systems follow a modular approach of dividing the problem into sound production, propagation, and spatialization. These can be mathematically formulated via the source directivity function, bidirectional impulse responses (BIR), and head-related transfer functions (HRTFs), respectively, leading to a general framework. Human auditory perception of acoustic responses deeply informs most systems, motivating optimizations such as the deterministic-statistical decomposition of the BIR.

We discussed many design considerations that inform the design of practical systems. We illustrated with a few auralization systems how the application requirements shape design choices, ranging from perceptual authenticity in architectural acoustics, to game engines where believability, audio design, and CPU usage take central priority. With more development, one can hope for auralization systems in the future that are capable of scaling their quality-compute trade-offs to span all applications of VR auralization. Such a convergent evolution would be in line with current trends in visual rendering where off-line photo-realistic rendering techniques and real-time game techniques are becoming increasingly unified [33].

Looking to the future, real-time auralization faces two major research challenges: scalability and scene dynamics. Game and VR scenes are trending toward completely open worlds where entire cities are modeled at once, spanning tens of kilometers, with numerous sound sources, where very few assumptions can be made about the scene’s geometry or complexity. Similar considerations hold for engineering prediction of outdoor acoustics, such as noise levels in a city. We need real-time techniques that can scale to such challenging scenarios within CPU budgets, perhaps by analogy with level-of-detail techniques used in graphics. Scene dynamism is a related challenge. Many current game engines allow the users to make global changes to immersive 3D worlds in real time. Dynamic techniques are required that can model, for instance, the diffraction loss around a just-created wall within tolerable latency. Progress in this direction has only just begun [35, 75, 83, 84].

The open challenge for the future is to build real-time auralization systems that can gracefully scale from plausible to accurate audio rendering for complex, dynamic, city-scale scenes depending on available computational resources. There is much to be done, and many undiscovered, foundational ideas remain.