1 Introduction

Over the past years, an increasing number of virtual and augmented reality (VR/AR) applications emerged due to the advent of mobile devices such as smartphones and head-mounted displays. Audio plays an important role within these applications that is by far not restricted to conveying semantic information, for example, through dialogues or warning sounds. Beyond that, audio holds information about the spaciousness of a scene including the location of sound sources and the reverberance or size of a virtual environment. In this way, audio can be regarded as a channel to provide semantic information and spatial information and improve the sense of presence and immersion at the same time. Due to the key role of audio in VR/AR, this chapter gives an overview of methods for audio quality assessment in Sect. 5.2, followed by a brief introduction of audio reproduction techniques for VR/AR in Sect. 5.3. Readers who are familiar with audio reproduction techniques might skip Sect. 5.3 and directly continue with Sect. 5.4 that gives an overview of the quality of existing audio reproduction systems.

2 Perceptual Qualities and Their Measurement

Methods and systems for generating virtual and augmented environments can be understood as a special case of (interactive) audio reproduction systems. Thus, in principle, all procedures for the perceptual evaluation of audio systems can also be used for the evaluation of VR systems [6]. These include the procedures for the evaluation of “Basic Audio Quality”, which are standardized in various ITU recommendations and focus on the technical system properties and signal processing, as well as approaches with a wider focus on the listening situation and the presented audio content, taking into account the “Overall Listening Experience”. In addition, a number of measures have recently been proposed to more specifically determine the extent to which technologies for virtual and augmented environments live up to their claim of providing a convincing equivalent to physical acoustic reality. Finally, in addition to these holistic measures for evaluating VR and AR, there are a number of (VR-specific and VR-nonspecific) quality inventories that can be used to perform a differential diagnosis of VR systems, highlighting the individual strengths and weaknesses of the system and drawing conclusions for the targeted improvement.

2.1 Generic Measures

2.1.1 Basic Audio Quality

Since the mid-1990s, the Radiocommunication Sector of the International Telecommunication Union (ITU-R) has developed a series of recommendations for the “Subjective assessment of sound quality”. The series includes an overview of the areas of application of the recommendations with instructions for the selection of the appropriate standard [35] as well as an overview of “general methods” which are applied slightly differently in the different standards [36]. They contain instructions for experimental design, selection of the listening panel, test paradigms and scales, reproduction devices, and listening conditions up to the statistical treatment of collected data. Originally, these recommendations were mainly used for the perceptual evaluation of audio codecs, but later, they were also used for the evaluation of multi-channel reproduction systems and 3D audio techniques. The central construct to be evaluated by all ITU procedures is “Basic Audio Quality” (BAQ). It can be evaluated either by direct scaling or by rating the “impairment” relative to an explicit or implicit reference and caused by deficits of the transmission system such as a low-bitrate audio codec or by limitations of the spatial reproduction. By definition BAQ includes “all aspects of the sound quality being assessed”, such as “timbre, transparency, stereophonic imaging, spatial presentation, reverberance, echoes, harmonic distortions, quantisation noise, pops, clicks and background noise” [36, p. 7], In studies of impairment, listeners are asked “to judge any and all detected differences between the reference and the object” [34, p. 7]. In this case, the evaluation of BAQ thus corresponds to a rating of general “similarity” or “difference”.

The most popular standards for BAQ are (cf. Fig. 5.1)

Fig. 5.1
figure 1

User interfaces for ABC/HR and MUSHRA tests. Active conditions are indicated by orange buttons; loop range and current playback position by orange boxes and lines. The ABC/HR interface shows only one condition but versions with multiple conditions per rating screen are also possible. If multiple conditions are displayed on a single screen, an additional button to sort the conditions according to the current ratings might help subjects to establish more reliable ratings (CC-BY, Fabian Brinkmann)

  • ITU-R BS. 1116-3:2016 (Methods for the subjective assessment of small impairments in audio systems) [34]. Listeners are asked to rate the difference between an audio stimulus and a given reference stimulus using a continuous scale with five labels (“Imperceptible”/“Perceptible, but not annoying”/“Slightly annoying”/ “Annoying”/“Very annoying”) used as “anchors”. Participants are presented with three stimuli (A, B, C). A is the reference, and B and C are rated, with one of the two stimuli again being the hidden reference (double-blind triple-stimulus with hidden reference).

  • ITU-R BS.1534 (Method for the subjective assessment of intermediate quality level of audio systems) [37]. Unlike ITU-R BS. 1116-3, it is a multi-stimulus test where direct comparisons between the different stimuli are possible. Quality is rated on a continuous scale with five labels (“Excellent”/“Good”/“Fair”/“Poor”/“Bad”). Participants are presented with a reference, no more than nine stimuli under test, and two anchor signals (MUlti-Stimulus test with Hidden Reference and Anchor, MUSHRA). The standard anchors are a low-pass filtered version of the original signal with a cut-off frequency of 3.5 kHz (low-quality anchor) and 7 kHz (mid-quality anchor). Alternatively or additionally, further non-standard anchors can be used; they should resemble the character of the systems’ artifacts being tested and indicate how the systems under test compares to well-known audio quality levels. Possible anchors in the context of spatial audio might be conventional mono/stereo recordings or non-individual signals. Since listeners can directly compare the signals under test with the reference and among each other, more reliable ratings can be expected in situations where stimuli differ significantly from the reference, but only slightly from each other.

Although BAQ is the standard attribute to be tested in both ITU recommendations, other attributes are suggested to test more specific aspects of audio systems such as spatial and timbral qualities. ITU-R BS.1284-2 contains a list of main attributes and sub-attributes, from which one can choose those suitable for a particular test [36, Attachment 1]. In this respect, both recommendations are often used only as an experimental paradigm, but applied to qualities other than BAQ, e.g., those developed in various taxonomies on the properties of VR systems (see Sect. 5.2.2.4).

A number of issues were raised addressing specific aspects of the ITU recommendations [55]. One pertains to the scale labels being multidimensional, which could distort the ratings. This can be avoided by using clearly unidimensional labels at both ends, e.g., “imperceptible”/“very perceptible” for ABC/HR or “good”/“bad” for MUSHRA and additional unlabeled lines for orientation. Another issue points out that data from MUSHRA tests often violate assumptions for conducting an Analysis of Variance (ANOVA), the most common means for statistical analysis of the results. This can be considered by using general linear models for the analysis, that are more flexible than ANOVA and pose less requirements on the input data [33].

2.1.2 Overall Listening Experience

The construct of “Overall Listening Experience” (OLE) [70] was derived from the concept of “Quality of Experience”, which in the context of quality management describes “the degree of delight or annoyance of the user of an application or service” [11], considering not only the technical performance of a system but also the expectations and personality and current state of the user as influencing factors. In contrast to listening tests according to the ITU recommendations, the musical content is thus explicitly part of the judgment that listeners make about the OLE.

A measurement of the OLE can be a useful alternative or supplement to purely system-related evaluations insofar as, for example, the difference between different playback systems for music may very well be audible in a direct comparison, but hardly relevant for everyday music consumption, also in comparison to the liking of the music played. In this respect, an evaluation according to ITU may possibly convey a false picture of the general relevance of technical functions. This becomes evident, for example, in a direct comparison between BAQ and OLE ratings of spatial audio systems, where the differences between BAQ ratings are generally larger than between OLE ratings. In a listening test, both BAQ ratings according to ITU-R BS. 1534 with explicit reference and OLE ratings (“Please rate for each audio excerpt how much you enjoyed listening to it”) without explicit reference were collected for three different spatial audio systems (2.0 stereo, 5.0 surround, 22.2 surround [71]). While the difference between 2.0 and 5.0 was equally visible in BAQ and OLE, the difference between 5.0 and 22.2 was clearly audible in a direct comparison (BAQ), but did obviously not result in a significant increase in listening pleasure (OLE, Fig. 5.2).

Fig. 5.2
figure 2

Results of a listening test (z-standardized scores) of Basic Audio Quality (BAQ) and Overall Listening Experience (OLE) for three different spatial audio systems (2.0 stereo, 5.0 surround, 22.2 sound referred to as “3D Audio”). BAQ ratings were given according to ITU-R BS.1534 relative to the “3D audio” condition as an explicit reference, whereas OLE ratings were given without a reference stimulus [71, p. 84]

2.2 VR/AR-Specifc Measures

2.2.1 Authenticity

A simulation that is indistinguishable from the physical sound field it is intended to simulate could be termed authentic. The term could be used in a physical sense; then it would aim at the identity of sound fields, be it the identity of sound pressures in the ear canal (binaural technology) or the identity of sound fields in an extended spatial area (sound field synthesis). Since no technical system is currently able to guarantee such an identity, and since such a physical identity may also not be required for the users of VR/AR systems, the term authenticity is mostly used in the psychological sense. In this sense, it denotes a simulation that is perceptually indistinguishable from the corresponding real sound field [8].

The challenge in determining perceptual authenticity is not to let the presence of a simulation or the physical reference in the listening test become recognizable solely through the environment of the presentation, i.e., by wearing headphones as opposed to listening freely in the physical sound field, or by listening in a studio environment that does not correspond to the simulated space even purely visually. For this reason, a determination of the authenticity of loudspeaker-based systems such as Wave Field Synthesis (WFS) or Higher-Order Ambisonics (HOA) can hardly be carried out in practice, because even if one were to suppress the visual impression by means of a blindfold, the listener would have to be brought from the playback room of the synthesis into the real reference room, which would no longer allow a direct comparison due to the temporal delay. Setting up a sound field synthesis in the corresponding physical room, on the other hand, would be prohibited, since the room acoustics of the physical room would influence the sound field of the loudspeaker synthesis.

Fig. 5.3
figure 3

Listening test setup for testing authenticity and plausibility. For seamless switching between audio from the loudspeakers and their binaural simulation, the subject is wearing extra-aural headphones that minimize distortions of exterior sound fields. The head position of the subject is tracked by an electromagnetic sensor pair mounted on the top of the chair and headphones. See also Sect. 5.4.1.1 (CC-BY, Fabian Brinkmann)

A determination of authenticity is simpler for binaural technology systems. By using open headphones that are largely transparent to the external sound field and whose influence can possibly be compensated by an equalization filter, a direct comparison can be made by switching back and forth between a physical sound source and its binaural simulation [8]. The influence of the headphones on the external sound field can be further minimized by using extra-aural headphones suspended a few centimeters in front of the ear [18]. Such an influence can also come from other VR devices such as head-mounted displays that are close to the ear canal [27]. An example of a listening test setup is shown in Fig. 5.3.

As a paradigm for the listening test, classical procedures such as ABX with explicit reference [12, 44] or forced-choice procedures (N-AFC) with non-explicit reference  [21] can be used, which have proven suitable for detecting small differences between two stimuli. It should be noted that, especially in the case of minor differences, the presentation mode can have a great influence on the recognition rate, such as the fact whether the two stimuli (simulation and reference) can be heard by the test participants only once or as often as desired [8, p. 1793 f]. An example of a user interface is given in Fig. 5.4.

Binaural representations can also be used to make comparisons of physical sound fields and simulations based on loudspeaker arrays [85]. For this purpose, the measured or numerically simulated sound field of a loudspeaker array at a given listening position can be presented in the listening test as a binaural synthesis, thus avoiding the problems described above when comparing physical and loudspeaker-simulated sound fields. It should be noted, however, in this case, the simulation (binaural synthesis) of a simulation (sound field synthesis) becomes audible, so it may be difficult to separate the artifacts of the two methods.

Fig. 5.4
figure 4

User interfaces for testing authenticity with an ABX test (also termed 2-interval/2-alternative forced choice, 2i/2AFC) test and testing plausibility with a yes/no paradigm. Responses/active conditions are indicated by orange buttons; loop range and current playback position by orange boxes and lines. In case of the test for plausibility, the audio starts automatically and can only be heard once (CC-BY, Fabian Brinkmann)

2.2.2 Plausibility

While the authenticity of virtual environments can be determined by the (physical or perceptual) identity of physical and simulated sound fields, plausibility has been proposed as a measure of the extent to which a simulation is “in agreement with the listener’s expectation towards a corresponding real event” [47]. Plausibility thus does not address the comparison with an external, presented reference, but the consideration against the background of an inner reference that reflects the credibility of the simulation, based on the listener’s experience and expectations of the internal structure of acoustic scenes or environments. The operationalization of this construct thus does not require a comparative evaluation, but a yes–no decision.

By analyzing such yes–no decisions with the statistical framework of signal detection theory (SDT, [84]), one can separate the response bias, i.e., a general, subjective tendency to consider stimuli as “real” or “simulated”, from the actual impairments of the simulation. Signal detection theory is originally a method for determining threshold values. For example, the absolute hearing threshold of sounds can be determined by the statistical analysis of a 2x2 contingency table in which two correct answers (sound present and heard, sound absent and not heard, i.e., hits and correct rejections) and two incorrect answers (sound present and not heard, sound absent and heard, i.e., misses and false alarms) occur. By contrasting these response frequencies, the response bias, i.e., a general tendency to mark sounds as “heard,” can be separated from actual recognition performance. The latter is represented by the sensitivity \(d{'}\) which can be converted to a corresponding 2AFC detection rate. A number of at least 100 yes–no decisions per subject is considered necessary for obtaining stable individual SDT parameters [40].

This approach can be applied to the evaluation of virtual realities, in that the artifacts caused by deficits in the simulation take on the role of a stimulus to be discovered, and listeners are asked to identify the environment as “simulated” if they notice them. The prerequisite for such an experiment is, however, that—similar to an experiment on “authenticity”—one can present both physically “real” and simulated sound fields without the nature of the stimulus already being recognizable on the basis of the experimental environment, for example, by providing a visual representation of the physical sound source also in the simulated case, or by conducting the experiment with closed or blindfolded eyes.

2.2.3 Sense of Presence and Immersion

A central function of VR systems is to create a “sense of presence”, i.e., the feeling of being or acting in a place, even when one is physically situated in another location and the sensory input is known to be technically mediated. The concept of presence, also called “telepresence” in older literature in reference to teleoperation systems used to manipulate remote physical objects [58], has given rise to its own research direction and community in the form of presence research, which is organized in societies such as the International Society for Presence Research (ISPR) and conferences such as the biennial PRESENCE conference.Footnote 1

To measure the degree of presence, different questionnaires have been developed. For an overview see [72]. The instrument of Whitmer and Singer [87], one of the most widely used questionnaires, contains 28 questions such as “How much were you able to control events?”, “How responsive was the environment to actions that you initiated (or performed)?”, “How natural did your interactions with the environment seem?”, or “How completely were all of your senses engaged?”. Analyzing the response patterns in these questionnaires, different dimensions such as “Involvement”, “Sensory fidelity”, “Adaptation/immersion”, and “Interface quality” have emerged in factor analytic studies [86].

Other approaches to measuring presence include behavioral measurements. If one assumes that presence is given if the reactions to a virtual environment correspond to the behavior in physical environments, then for example, the swaying caused by a moving visual scene or ducking in response to a flying object can be used as an indicator for the degree of presence [19]. As a prerequisite for such realistic behavior, Slater considers two aspects: The sensation of being in a real place (“place illusion”) and the illusion that the scenario being depicted is actually occurring (“plausibility illusion”) [75]. Note, however, that “plausibility” is used here, in comparison with the understanding used in Sect. 5.2.2.2, in a narrower sense with a slightly different meaning.

A similar idea is behind the use of psychophysiological measures. If the normal physiological response of a person to a particular situation is replicated in a VR environment, this can be considered as an indicator of presence. Although physiological parameters have been used to measure various functions and applications of VR systems [28], they have also been used to measure presence in several studies. Depending on the scenario presented, the Electroencephalogram (EEG) [5], heart rate (HR) [14], or skin conductance and heart rate variability [13] were shown to be indicators of different degrees of presence. The exact correlations, however, seem to depend very much on the scenario presented in each case, and in any case, comparative values from a corresponding real-life stimulus are required to calibrate the measurement. Also breaks in presence (BIPs), i.e., moments where the users become aware of the mediatedness of the VR experience due to shortcomings of the system becoming suddenly obvious seem to be associated with physiological responses [76].

In general, these approaches seem to be limited to situations in which physiological reactions are sufficiently pronounced, such as anger, fear, or stress [54], whereas reactions are less pronounced when the person is predominantly an observer of a scene that has little emotional impact. This may be the reason why manipulations to the level of presence in these studies were almost exclusively realized through changes to the visual display and user interaction, while physiological parameters were hardly used to evaluate the degree of presence in acoustic virtual environments.

The sense of presence, long used as a measure for evaluating VR and AR systems alone, has recently gained increasing attention as a general neuropsychological phenomenon evolving from biological as well as cultural factors [68]. From the perspective of evolutionary psychology, the sense of presence has evolved not to distinguish between real and virtual conditions, but to distinguish the external world from phenomena attributable to one’s own body and mind. On such a theoretical basis, it seems consequent that for achieving a high presence not only the sensory plausibility and the naturalness of the interaction but also the meaning and relevance of the scene for the respective user is essential. The degree of presence in a virtual scene will remain limited if the content is irrelevant to the respective user [66].

Related to the sense of presence, but less consistently used, is the concept of “immersion”. In some literature, it is treated as an objective property of VR and AR systems [77]. According to this technical understanding, a 5-channel system is considered more “immersive” than a two-channel system, simply because it is able to present a wider range of sound incidence directions to the listener. In other works, however, immersion is treated as psychological construct, i.e., a human response to a technical system [87], shifting the meaning of “immersion” closer to the concept of presence [74]. Finally, in many works, especially in the field of audio, it remains unclear whether the reasoning about immersion is on a technical or psychological level. Chapter 11 discusses more in depth the aforementioned issue focusing on audiovisual experiences.

2.2.4 Attributes and Taxonomies

With properties such as authenticity, plausibility, or the sense of presence, a global assessment of VR systems is intended. In order to obtain indications of the strengths and weaknesses of these systems and to draw appropriate conclusions for improvement, however, a differential diagnosis is required that separately assesses different qualities of the respective systems. To distinguish these perceptual qualities from technical parameters of the system that may have an influence on them, the former is also referred to as “Quality features” and the latter as “Quality elements” in the Context of Product-Sound Quality [38].

For this purpose, different taxonomies for the qualities of virtual acoustic environments, 3D audio or spatial audio systems have been developed. Some of these are based on earlier collections of attributes for sound quality and spatial audio quality [42] which were clustered in sound families using semantic analyses such as free categorization or multidimensional scaling (MDS) [43]. Pedersen and Zacharov (2015) [62] developed a sound wheel to present such a lexicon for reproduced sound.Footnote 2 The wheel format has a longer tradition in the domain of food quality and sensory evaluation [60] as a structured and hierarchical form of a lexicon of different sensory characteristics. The selection of the items and the structure of the wheel in [62] are based on empirical methods such as hierarchical cluster analysis and measures for discrimination, reliability, and inter-rater agreement of the individual items.

Fig. 5.5
figure 5

SAQI wheel for the evaluation of virtual acoustic environments, structured into informal categories (inner ring) and attributes (outer ring). For definitions and sound examples refer to depositonce.tu-berlin.de/handle/11303/157.2 (CC-BY, Fabian Brinkmann)

While the taxonomies mentioned above were developed for spatial audio systems and product categories such as headphones, loudspeakers, multi-channel sound in general, others were generated with a stronger focus on virtual acoustic environments. Developed by qualitative methods such as expert surveys (DELPHI method [73]) and expert focus groups [48], they contain between 7 [73] and 48 attributes [48], from which those relevant to the specific experiment can be selected. Examples of a VR/AR specific taxonomy and a rating interface are shown in Figs. 5.5 and 5.6.

Fig. 5.6
figure 6

User interface for conducting a SAQI test. The interface is similar to that of a MUSHRA test shown in Fig. 5.1 with the difference that the current quality to be rated is given together with the possibility to show its definition (info button) and that the rating scale can also be bipolar. In any case, zero ratings indicated no perceivable difference (CC-BY, Fabian Brinkmann)

2.3 VR/AR-Specific User Interfaces, Test Procedures, and Toolkits

While the quality measures introduced so far can theoretically be directly transferred for testing in VR and AR, there are specific features that should be addressed: The test method and interface, the technical administration of the test, and the effect of added degrees of freedom on the subjects.

First, most of the test methods and user interfaces were developed to be accessed on a computer with a mouse as a pointing and clicking device. The rating procedure and the elements on the user interface might thus not be optimal for testing in VR/AR. This might be less relevant for simple paradigms such as ABX or yes/no tests but can certainly become an issue for rating the quality of multiple test conditions.

Fig. 5.7
figure 7

Interface of the Drag and Drop MUSHRA after [81]. The currently playing condition is indicated by the orange button; the loop range and playback position by the orange box and line (CC-BY, Fabian Brinkmann)

Two approaches were suggested to account for this. Völker et al. [81] suggested a modified MUSHRA to simplify the rating interface and make it easier to establish an order between test conditions, especially if many test conditions are to be compared against the reference and each other (cf. Fig. 5.7). The idea is to unify playback and rating by making use of drag and drop actions, where the playback is triggered when the subject drags a button corresponding to a test condition, and the rating is achieved by dropping the button on a two-dimensional scale. Ratings obtained with the modified interface were comparable to those obtained with the classic interface in terms of test–retest reliability and discrimination ability. At the same time, the modified interface was preferred by the subjects, and subjects needed less time to complete the rating task. Note that the Drag and Drop MUSHRA could be easily adapted for testing quality taxonomies introduced in Sect. 5.2.2.4.

Fig. 5.8
figure 8

Interface of the elimination task after [67]. The currently playing condition is indicated by the orange button (CC-BY, Fabian Brinkmann)

A VR/AR-tailored approach to further simplify the rating procedure and interface was suggested by Rummukainen et al. [67]. They designed a simple and easy-to-operate interface, where the subject eliminates the conditions one after another in the order from worst to best (cf. Fig. 5.8). The elimination constitutes a rank order between the stimuli from which interval scaled values—similar to Basic Audio Quality ratings—were obtained by fitting Plackett–Luce models to the ranking vectors. As with the Drag and Drop MUSHRA, the elimination task could be adapted for testing against a reference and using taxonomies.

Classic tests of Basic Audio Quality are most often conducted for (static) audio-only conditions and a variety of software solutions is available to conduct such tests [6, Sect. 9.2.3]. In contrast, tests in VR/AR require the experimental control of complex audiovisual scenes. In addition, the display of rating interfaces might affect the Quality of Experience (QoE) of interactive environments due to their potentially negative effect on the perceived presence [65]. An emerging tool to account for these aspects of AR/VR is the Quality of Experience Evaluation Tool (Q.ExE) currently developed by Raake et al. [65].

A third VR/AR-specific aspect is the possibility of freely exploring an audiovisual scene in six degrees of freedom (6DoF). Introducing 6DoF clearly affects the rating behavior of subjects [67] and might thus be considered problematic at first glance. An unrestricted 6DoF exploration is, however, the most realistic test condition. While this might introduce additional variance in the results, it might also be argued that results are more comprehensive and reflect more aspects of the audiovisual scene due to free exploration. Whether or not the exploration should be restricted will thus ultimately depend on the aim of an investigation.

3 Audio Reproduction Techniques

Two fundamentally different paradigms can be distinguished in audio reproduction for VR/AR that can be illustrated with the help of Fig. 5.9. The picture shows a simple sound field of a point source being reflected by an infinite wall.

The first paradigm is to reproduce the entire sound field in a controlled zone, which has two advantages. First, multiple listeners can freely explore the sound field at the same time, and second, the reproduction is already individual as every listener naturally perceives the sound through their own ears. However, there are three disadvantages. First, reproducing the entire sound field requires tens or hundreds of loudspeakers depending on the reproduction algorithm and the size of the listening area. Second, it requires an acoustically treated environment to avoid detrimental effects due to reflections from the reproduction room itself. Third, it is often challenging to achieve a correct reproduction covering the entire hearing range from approximately 20 Hz to 20 kHz. In the following, this reproduction paradigm will be referred to as sound field synthesis (SFS).

The second paradigm is to only reproduce the sound field at the listeners’ ears. The three advantages of this approach are that it can be realized with a single pair of headphones or loudspeakers, that at least headphone-based reproduction does not pose any demands on the reproduction room, and that a broad frequency range can be correctly reproduced. In turn, two disadvantages arise. First, the position and head orientation of the listeners must be tracked to enable a free exploration of the sound field. Second, the individualization of the ear signals is challenging. Often, the reproduced signals stem from a dummy head, which can cause artifacts such as coloration and increased localization errors in case the ears, head, and torso of the listener differ from the dummy head. This reproduction paradigm will be referred to as binaural synthesis in the following.

It is interesting to see that the advantages and disadvantages of the two paradigms are exactly contrary thus generating a strong bond between the application and reproduction paradigm, whereas binaural synthesis is the apparent option for any application on mobile devices, sound field synthesis is appealing for public or open spaces such as artistic performances and public address systems. The next sections will introduce the two paradigms in more detail. We focus on technical aspects but start with brief theoretical introductions to foster a better understanding of the subject as a whole.

Fig. 5.9
figure 9

Sound field of a point source reflected by an infinite wall. The direct and reflected sound fields are shown as red and blue circles and the direct and reflected sound paths to the listener as red and blue dashed lines. The image of the head in gray denotes the listening position. (CC-BY, Fabian Brinkmann)

3.1 Sound Field Analysis and Synthesis

The idea behind sound field analysis and synthesis (SFA/SFS) is to reproduce a desired sound field within a defined listening area using a loudspeaker array. The example in Fig. 5.10 shows this for the simple case of a plane wave traveling in the normal direction of a linear array.

Two fundamentally different SFA/SFS approaches can be distinguished. Physically motivated algorithms aim at capturing and reproducing sound fields physically correct, while perceptually motivated methods aim at capturing and synthesizing sound field properties that are deemed to be of high perceptual relevance.

Fig. 5.10
figure 10

Sound field synthesis of a plane wave traveling from bottom to top (red fat lines) by a linear point source array (blue points and blue thin semi-circles) flush-mounted into a sound hard wall (gray line) (CC-BY, Fabian Brinkmann)

3.1.1 Sound Field Acquisition and Analysis

Sound field synthesis requires a sound field that should be reproduced and there are two options for its acquisition: through measurement or simulation. Measured sound fields can have a high degree of realism and can, for example, be used for broadcasting concerts, while simulated sound fields offer more flexibility in the design of the auditory scene and are thus often used in game audio engines (please refer to Chap. 3 for an introduction to interactive auralization). The description and evaluation of sound field simulation techniques is beyond the scope of the article and the interested reader is kindly referred to related review articles [10, 79].

Sound fields are usually measured through microphone arrays, i.e., spatially distributed microphones that are in most cases positioned on the surface of a rigid or imaginary sphere. They can be used to directly record sound scenes such as concerts. In some cases, however, a direct recording will be limiting as it does not allow to change the audio content once the recording is finished. This can be realized if so-called spatial room impulse responses (SRIRs) are measured, i.e., impulse responses that describe the sound propagation between sound sources and each microphone of the array.

A common method for physically motivated SFA is the plane wave decomposition (PWD), which applies Fourier Transforms with respect to time and space to the acquired sound field [64, Chap. 2]. It derives a spatially continuous description of the analyzed sound field containing information on the times and directions of arriving plane wave. If the analyzing array has sufficiently many microphones, PWD can yield a physically correct and complete description of the sound field.

Popular approaches for perceptually motivates SFA are spatial impulse response rendering (SIRR), directional audio coding (DirAC), and the spatial decomposition method (SDM) [64, 78, Chaps. 4–6]. These approaches use a time–frequency analysis to extract the direction of arrival and in case of SIRR and DirAC also the residual diffuseness for each time–frequency slot. The intention of this is to extract these information from signals recorded with only a few microphones—typically between 4 and 16—and reproduce the signals with an increased resolution using methods introduced in the following sections. SIRR and SDM only work with SRIRs, while PWD and DirAC also work with direct recordings. While SDM uses a broadband frequency analysis and extremely short time windows, the remaining methods use perceptually motivated time and frequency resolutions. SDM is able to extract a single prominent reflection per time window while the PWD and higher order realizations of SIRR and DirAC can detect multiple reflections in each time–frequency slot.

3.1.2 Physically Motivated Sound Field Reproduction

The two methods for physically motivated sound field reproduction are wave field synthesis (WFS, works with linear, planar, rectangular, and cubic loudspeaker arrays) and near-field compensated higher order Ambisonics (NFC-HOA, works with circular and spherical arrays) [1]. Both methods can reproduce plane waves and point sources by filtering and delaying the sounds for each loudspeaker in the array. In the simple case shown in Fig. 5.10, all loudspeakers play identical signals. Because of their high computational demand, WFS and NFC-HOA are rarely used with measured sound fields that consist of hundreds of sources/waves. One possible approach is to use only a few point sources for the direct sound and early reflections, and a small number of plane waves for the reverberation.

3.1.3 Perceptually Motivated Sound Field Reproduction

The most common methods for perceptually motivated sound field reproduction are vector-based amplitude panning (VBAP), multiple direction amplitude panning (MDAP), and Ambisonics panning, which aim at reproducing point-like sources [89, Chaps. 1, 3, and 4]. VBAP is extensions of stereo panning to arbitrary loudspeaker array geometries. It uses one to three speakers that are closest to the position of the virtual source to create a phantom source. MDAP creates a discrete ring of phantom sources—each realized using VBAP—around the position of the virtual source to achieve that the perceived source width becomes almost independent from the position of the virtual source. Ambisonics panning could be thought of as a beamformer that uses all loudspeakers of the array simultaneously to excite circular or spherical sound field modes. In this case, the position of the virtual source is given by the position of the beam. Similar to MDAP, Ambisonics yields virtual sources with an almost position-independent perceived width. In all cases, the degree to which the width of the sources can be controlled increases with the number of loudspeakers.

In many applications, these methods are used as a means to reproduce sound fields that were analyzed using SIRR, SDM, and DirAC. Two reasons for this are their computational efficiency and the fact that they are relatively robust against irregular loudspeaker arrays (non-spherical, missing speakers), which are advantages over physically motivated approaches. VBAP and MDAP are robust to irregular arrays by design (they do not pose any demands on the array geometry). This is not generally true for Ambisonics panning, however, the state-of-the-art All-Round Ambisoncs Decoder (AllRAD, [89, Sect. 4.9.6]), which combines VBAP and Ambisonics panning, can well handle irregular arrays.

Fig. 5.11
figure 11

Example of a headphone-based pipeline for binaural synthesis. Dashed lines indicate acoustic signals; black lines indicate digital signals; gray lines indicate movements in 6DoF. \(H_c\) denote compensation filters for the recording (yellow) and reproduction equipment (red, CC-BY, Fabian Brinkmann)

3.2 Binaural Synthesis

The fundamental theorem of binaural technology is that recording and reproducing the sound pressure signals at a listener’s ears will evoke the same auditory perception as if the listener was exposed to the actual sound field. This is because all acoustic cues that the human auditory system exploits for spatial hearing are contained in the ear signals. These cues are interaural time and level differences (ITD, ILD), spectral cues (SC), and environmental cues. ITD and ILD stem from the spatial separation of the ears and the acoustic shadow of the head and make it possible to perceive the position of a source in the lateral dimension (left/right). Spectral cues originate from direction-dependent filtering of the outer ear and enable us to perceive the source position in the polar dimension (up/down). The most prominent environmental cue might be reverberation from which information about the source distance and the size of a room can be extracted. For more information please refer to Blauert  [7] and to Chap. 4 of this volume.

An example of a binaural processing pipeline with headphone reproduction is shown in Fig. 5.11. The processed binaural signals are stored or directly streamed to the listener whereby the signals are selected and/or processed according to the current position and head orientation of the listener. In any case, a physically correct simulation requires compensating the recording and reproduction equipment (loudspeakers, microphones, headphones) to assure an unaltered reproduction of the binaural signals. These compensation filters are usually separated for signal acquisition and reproduction to maximize the flexibility of the pipeline. For the same reason, anechoic or dry audio content is often convoluted with acquired binaural impulse responses, which makes it possible to change the audio content, without changing the stored binaural signals. The next sections detail the blocks of the introduced reproduction pipeline one by one.

Fig. 5.12
figure 12

HRIR measurement system at the Technical University of Berlin with details of the position procedure using cross line lasers. During the measurement, the subjects are wearing in-ear microphones, are sitting on the chair in the center of the loudspeaker array, and are continuously rotated to measure a full spherical HRIR data set. In addition, the wire frames on the floor are covered with absorbing material (CC-BY, Fabian Brinkmann)

3.2.1 Signal Acquisition and Processing

The most basic technique is to directly record sound events—for example a concert—with a dummy head, i.e., a replica of a human head (and torso) that is equipped with microphones at the positions of the ear channel entrance or inside artificial ear channels. This requires a straightforward compensation of the recording microphones by means of an inverse filter, whereas the sources are considered to be a part of the scene and thus remain uncompensated. This approach is, however, very inflexible because the position and orientation of the listener and sources can not be changed during reproduction. It is thus more common to measure or simulate spherical sets of head-related impulse responses (HRIRs) that describe the sound propagation between a free-field sound source and the listeners ears (cf. [88, Chaps. 2 and 4] and Fig. 5.12). In this case, the sound source has to be compensated as well. The gain in flexibility stems from the possibility to use anechoic or dry audio content and select the HRIR according to the current source and head position of the listener. While HRIRs are not often directly used because anechoic listening conditions are unrealistic for most applications, they are essential for room acoustic simulations [80]. Acoustic simulations can be used to obtain binaural room impulse responses (BRIRs) that describe the sound propagation between a sound source in a reverberant environment and the listeners ears. BRIRs can also be measured, thereby increasing the degree of realism at the cost of increasing the effort to measure BRIRs for multiple positions and orientations of the listener to enable listener movements during playback.

3.2.2 Head Tracking

Tracking the head position of the listener is required for dynamic binaural reproduction, i.e., a reproduction that accounts for movements of the listener by providing binaural signals according to the angle and distance between the source and the listener’s head. While it will be sufficient for some applications to only track the head orientation, the general VR/AR case requires six degrees of freedom (6DoF, i.e., translation and rotation in x, y, and z).

In general, two tracking approaches exist. Relative tracking systems track the position of the listener with respect to a potentially unknown starting point, while absolute tracking systems establish a world coordinate system within which the absolute position of the listener is tracked. Relative systems usually use inertial measurement units (IMU) to derive the listener position from combined sensing of a gyroscope, an accelerometer, and possibly a magnetometer. Absolute systems can use optical tracking by deriving the listener position from images of a single or multiple (infrared) cameras, or GPS data.

Artifact-free rendering requires a tracking precision of \(1^\circ \) and 1 cm [32, 46], and a total system latency of about 50 ms [45]. Note that a significantly lower latency of about 15 ms is required for rendering visual stimuli in AR applications [39]. A challenge for relative tracking systems is to control long-term drift of the IMU unit, while visual occlusion is problematic for optical absolute tracking systems.

3.2.3 Reproduction with Headphones

Headphone reproduction requires a compensation of the headphone transfer function (HpTF) by means of an inverse filter to deliver the binaural signals to the listener’s ear without introducing additional coloration. However, the design of the inverse filter is not straightforward. Two aspects are problematic. First, the HpTF considerably varies across listeners and headphone models, which may require the use of listener and model-specific compensation filters depending on the demands of the application. Second, the low-frequency response and the center frequency and depth of high-frequency notches in the HpTF strongly depend on the fit of the headphone and may considerably change if the listener re-positions the headphones (cf. Fig. 5.13). To account for the variance, the average HpTF can be used to design the inverse filter, and the filter gain at low and high frequencies can be restricted using regularized inversion [24, 46]. Once calculated, the static headphone filter can be applied to the binaural signals by means of convolution.

Fig. 5.13
figure 13

Headphone transfer functions of subject 6 from the HUTUBS HRTF database for the left ear of a Sennheiser HD650 headphone [9]. Gray lines show the effect of re-positioning. Black lines show the averaged HpTF (CC-BY, Fabian Brinkmann)

In addition to this static convolution, a dynamic convolution is often required to render the current HRIR or BRIR. Since real-time audio processing works on blocks of audio, this is simply achieved by using the current HRIR as long as the listener does not move. If the listener moves, the past and current HRIR are both convolved simultaneously and a cross fade with the length of one audio block is applied between the two [82].

3.2.4 Reproduction with Loudspeakers

While delivering binaural signals through headphones is the most obvious solution due to the one-to-one correspondence between the two ears and two speakers of the headphone, two approaches for transaural reproduction using loudspeakers are also available.

The first approach uses only two loudspeakers. In analogy to headphone reproduction, there is a one-to-one correspondence between the ear signals and speakers, and the filter for the left loudspeaker compensates for the transfer function between the speaker and the left ear. In contrast to headphone reproduction, however, this requires an additional filter for cross-talk cancellation (CTC) between the right speaker and the left ear (the filters for the right ear work accordingly). This requires an iterative design of the compensation filters for all possible positions of the head with respect to the loudspeakers and thus a dynamic convolution already for the compensation filters [51]. Optionally, more loudspeakers can be used to optimize the system for different listening positions or frequency ranges.

The second approach uses linear or circular loudspeaker arrays. Here, the idea is to shoot two narrow audio beams in the direction of the listener’s ears. Because the beams concentrate most of their energy towards the listener’s ears, a high separation between the left and right ear beams can be achieved depending on the array geometry [20]. In this case, a one-to-one correspondence is established between the two beams and the ears, and cross-talk compensation is not required if the beams are sufficiently narrow. In this case, a dynamic convolution is required to update the beamformers according to the listener’s position.

3.3 Binaural Reproduction of Synthesized Sound Fields

It is worth to note that SFS approaches can be combined with binaural reproduction, either by virtualizing the loudspeaker array with an array of HRIRs or through binaural processing stages that build upon the sound field analysis (c.f., [2], [64, Sect. 6.4.2] and [89, Sect. 4.11]). This makes binaural reproduction the prime framework for rendering spatial audio in AR/VR and SFS a versatile tool within the framework: First, SFS makes it possible to efficiently render binaural signals for arbitrary head orientations from a single SRIR (might require pre-processing to achieve a reasonable quality as detailed in Sect. 5.4.3). Second, SFS makes it possible to include listener movements (translation)—to a limited extent—and thus enables rendering with 6DoF. The realization of 6DoF rendering depends on the sound field representation, which strongly differs across SFS approaches. However, the general idea agrees in many cases. Head rotations can be realized by an inverse rotation of the sound field. For perceptually motivated SFS methods, translation can be realized by manipulating the directions and times of arrival that were obtained through SFA according to the listener’s movements (e.g., [41]). The possibility of realizing translation with physically motivated SFS approaches and measured sound fields is, however, rather limited as this would require arrays with hundreds if not thousands of microphones.

4 System Performance

This section details the quality that can be achieved with the different reproduction paradigms, starting with binaural synthesis. This is the most common approach, and in case it is used in combination with SFS, it also limits the maximally achievable quality of the SFS.

4.1 Binaural Synthesis

The authenticity and plausibility of a reproduction system are without a doubt the most integral and comprehensive quality measures and are thus discussed first. However, it is also important to shed light on the relevance of individual components in the reproduction pipeline. While there are many small pieces that contribute to the overall quality, the most relevant might be the individualization of binaural signals, head tracking, and audiovisual stimulation, which are discussed separately.

Fig. 5.14
figure 14

Results of the test for authenticity. Top: Range of differences between the sound field of the real and virtual frontal loudspeakers across head-above-torso orientations. Data was measured at the blocked ear channel entrance and is shown as 12th (light blue) and 3rd octave (dark blue) smoothed magnitude spectra. Bottom: 2-Alternative Forced Choice detection rates for all participants, two audio contents, source positions in front (\(0^\circ \)) and to the left (\(90^\circ \)), and three different acoustical environments (cf. Fig. 5.3). The size of the dots and the numbers next to them indicates how many participants scored identical results. Results on or above the dashed line are significantly above chance, indicating that differences between simulated and real sound fields were reliably audible. 50% correct answers denotes guessing (CC-BY, Fabian Brinkmann)

4.1.1 Authenticity and Plausibility

Headphone-based individual dynamic binaural synthesis can be authentic if reverberant environments and real-life signals, such as speech, are simulated. For this typical use case, 66% of the subjects in Brinkmann et al. [8] could not hear any differences between a real loudspeaker and its binaural simulation (cf. Fig. 5.14, bottom). However, differences such as coloration become audible if simulating anechoic environments or artificial noise signals. Remaining differences stem from accumulated measurement errors in the range of 1 dB mostly related to the positioning of the subject and the in-ear microphones during the experiment (cf. Fig. 5.3, top). Clearly, these differences can be detected more easily with steady broadband signals such as noise. The effect of reverberance might be twofold. First, the reverberation might be able to mask audible coloration in the direct sound, and second, reverberant parts of the BRIR might be less prone to coloration artifacts because measurement errors could cancel across reflections arriving from multiple directions.

Loudspeaker-based individual binaural synthesis by means of CTC can be authentic in anechoic reproduction rooms [59]. However, the quality drastically decreases if the CTC system is set up in reverberant environments, thus limiting the usability of this approach. The decrease in quality is caused by undesired reflections from the reproduction room that can not be compensated in practice due to uncertainties in the exact position of the listener [69].

Non-individual dynamic binaural synthesis is not authentic but can be plausible, i.e., matching the listeners expectation towards the acoustic environment. This means that differences between a real sound field and a non-individual simulation are audible in a direct comparison, but they are not large enough for the simulation to be detected as such in an indirect comparison. Although the plausibility was only shown for headphone base reproduction of reverberant environments [47, 63], it is reasonable to assume that this also holds the simulation of anechoic environments and loudspeaker based reproduction in anechoic environments. Remaining differences between real sound fields and binaural simulations are discussed in the following section.

An example setup for testing authenticity and plausibility is shown in Fig. 5.14. It is important to note that authentic simulations can only be achieved under carefully controlled laboratory conditions. Otherwise, the placement of the headphones will already introduce audible artifacts that would be hard to control in any consumer application [61]. It can, however, be assumed that such artifacts are irrelevant for the vast majority of VR/AR applications, where plausibility is a sufficient quality criterion.

4.1.2 Effect of Individualization

Binaural signals (binaural recordings, HRIRs, BRIRs) are highly individual, i.e., they differ across listeners due to different shapes of the listeners‘ ears, heads, and bodies. As a consequence, listening to non-individual binaural signals decreases the audio quality and can be thought of as listening through someone else’s ears. While the decrease in quality could already be seen in the integral measures authenticity and plausibility, this section will look at differences in more detail.

The most discussed degradation caused by non-individual signals is increased uncertainty in source localization [57]. Using individual head-related transfer functions (HRTFs, the frequency domain HRIRs), median route mean squared localization errors are approximately 27\(^\circ \) for the polar angle, which denotes the up/down source position, and 15\(^\circ \) for the lateral angle, which denotes the left/right position. Quadrant errors, which are a measure for front–back and up–down confusions (and mixtures thereof), occur in only 4% of the cases. A drastic increase of the quadrant error by a factor of 5 to about 20% and the polar error by a factor of 1.5 to about 40\(^\circ \) can be observed if using non-individual signals. Because source localization in the polar dimension relies on high-frequency cues in the binaural signal, the increased errors can be attributed to differences in ear shapes, which have the strongest influence on binaural signals at high frequencies. The lateral error increases by only 2\(^\circ \). In this case, the auditory system exploits interaural cues (ITD, ILD) for localization, which stems from the overall head shape. The fact that head shapes differ less between listeners than ear shapes explains the relatively small changes in this case.

Fig. 5.15
figure 15

Perceived differences between a real sound field and the individual (blue, left) and non-individual (red, right) dynamic binaural simulation thereof. Results are pooled across an anechoic, dry, and wet acoustic environment. The horizontal lines show the medians, the boxes the interquartile ranges, and the vertical lines the minimum and maximum perceived differences. Scale labels were omitted for clarity and can be found in [48] (CC-BY, Fabian Brinkmann)

Whereas localization might be one of the most important properties of audio in virtual acoustic realities, it is by far not the only aspect that degrades due to non-individual signals. An extensive qualitative analysis is shown in Fig. 5.15. The results were obtained with pulsed pink noise as audio content in a direct comparison between a frontal loudspeaker- and headphone-based dynamic binaural syntheses using the setup shown in Fig. 5.3. Apart from qualities related to the scene geometry (localization, externalization, etc.), considerable degradations can also be observed for aspects related to the tone color. In sum, this also lead to a larger overall difference and subjects rated the non-individual simulation to be less natural and clear than its individual counterpart. As a result, the individual simulation was generally preferred (attribute liking), however, the presence was not affected. Because the similarity between the individual BRIRs and the non-individual BRIRs used in the test depends on the listener, the results for non-individual synthesis have considerably higher variance (indicated by the interquartile ranges).

Differences for individual binaural synthesis are small compared to non-individual synthesis. In this case, noteworthy differences only remain for the tone color. These differences stem from measurement uncertainties that arise mostly due to positioning inaccuracies of the subjects and in-ear microphones. As mentioned above, these differences become inaudible if using speech signals instead of pulsed noise.

Individualization is not only important for HRIRs and BRIRs but also for the headphone compensation (HpC). The examples above either used fully individual (individual HRIRs/BRIRs and HpC) or fully non-individual (non-individual HRIRs/BRIRs and HpC) simulations. Combinations of these cases were investigated by Engel et al. [15] and Gupta et al. [26]. As expected, fully individual simulations always have the highest quality, and considerable degradations can be observed if using individual signals with a non-individual HpC. If an individual HpC is not feasible, differences between individual and non-individual signals were only significant for the source direction but not for the perceived distance, coloration, and overall similarity. In any case, at least a non-individual HpC should be used because differences are the largest for simulations without HpC.

Many individualization approaches are available that mitigate the detrimental effects of non-individual signals to a certain degree [25]. However, they demand additional action from the listener to obtain individual or individualized signals. It is thus worth noting—and discussed in the next sections—that head tracking and visual stimulation are two means to mitigate some effects that do not require actions from the listener.

4.1.3 Effect of Head Tracking

Without head tracking, the auditory scene will move if the listeners move their head, which is a very unnatural behavior for most VR/AR applications. Head-tracked dynamic simulations in which the auditory scene remains stable during head movements have thus become the standard. Besides the general improvement of the sense of presence and immersion, this has at least two more benefits.

First, localization errors for non-individual signals decrease if head tracking is enabled [52]. While the lateral localization errors remain largely unaffected, front–back confusion completely disappears if the listeners rotate their head by 32\(^\circ \) or more to the left or right. This can be explained by movement-induced dynamic changes in the binaural signals. As listeners move their head to the left, the left ear moves away from the source if it is in front, and the right ear moves towards it. Because this behavior would be exactly reversed for a source behind the listener, the auditory system is able to resolve the front–back confusion through the head motion. Up–down confusion can be resolved in analogy through head nodding to the left or right. Additionally, the elevation error decreases by a third for head rotations of 64\(^\circ \) to the left or right. This can be explained by the fact that dynamic changes in the binaural signals are largest for a frontal source and almost disappear for a source above and below the listener.

The second benefit pertains to the externalization of non-individual virtual sources [31]. While sources to the side are well externalized even with non-individual signals, sources to the front and rear were often reported to be perceived as being inside the head. The most likely reason for this is that signals for sources close to the median plane are similar for the left and right ears. In contrast, the ear signals differ in time and level for sources to the side. These differences stem from the spatial separation of the ears and the acoustic shadow of the head and might provide the auditory system with evidence of the presence of an external source. If listeners perform large head rotations to the left and right, dynamic binaural cues are induced and the externalization of frontal and rear sources significantly increases.

Despite the positive effects of head tracking, it has to be kept in mind that listeners will not always perform large head movements just because they can. The actual benefit might thus often be smaller than reported above. However, dynamic cues that are similar to those of head movements can also be induced by a moving source, which was shown to have a similarly positive effect on externalization [30]. An effect of source movements for localization has not yet been extensively investigated. For the case of distance localization, it was already shown that active self-motion is more efficient than passive self-motion and source motion [22].

4.1.4 Effect of Visual Stimulation

Because VR/AR applications usually provide congruent audiovisual signals, it is worth to consider the effect of visual stimulation on the audio quality. Interestingly—and in contrast to head tracking—visual stimulation can have positive and negative effects.

The possibly most important positive aspect is the ventriloquism effect, which describes the phenomenon that a fused audiovisual event is perceived at the location of the visual stimuli even if the position of the auditory event deviates from that of the visual event. Median thresholds below which fusion appears are approximately 15\(^\circ \) in the horizontal plane and 45\(^\circ \) in the median plane if presenting a realistic stereoscopic 3D video of a talker [29]. Comparing this to localization errors reported in Sect. 5.4.1.2, it can be hypothesized that localization errors will drastically decrease if not completely disappear even for non-individual binaural synthesis due to audiovisual fusion and the ventriloquism effect if a source is visible and in the field of view. It has to be kept in mind, however, that the degree of realism of the visual stimulation—termed compellingness in [29]—affects the strength of the ventriloquism effect. Thus, fusion thresholds can decrease for less realistic visual stimulation.

Quality degrading effects can occur if the (expected) acoustics of the visually presented room does not match the acoustics of the auditorily presented room—an effect termed room divergence. This effect is especially relevant for AR applications where listeners can naturally explore real audiovisual environments to which artificial auditory or audiovisual events are added. However, room divergence can also appear in VR applications for example due to badly parameterized room acoustic simulations. Room divergence is not extensively researched up to date, but it was already shown that it can affect distance perception and externalization [23, 83]. While degradations with respect to these qualities might as well be mitigated by the ventriloquism effect [56], the room divergence might also affect higher level qualities such as plausibility and presence.

4.2 Sound Field Synthesis

The discussion of SFA/SFS is limited to perceptually motivated approaches because they are predominantly used in VR/AR applications. In-depth evaluations of physically motivated approaches were, for example, conducted by Wierstorf [85] and Erbes [17].

4.2.1 Vector-Based and Ambisonics Panning

The most important quality factor for loudspeaker-based reproduction approaches is the number of loudspeakers L. In case of Ambisonics, there is a strict dependency between L and the achievable spatial resolution, which is determined by the so-called Ambisonics order \(N\lesssim (L+1)^2\). Intuitively, the spatial resolution increases with increasing Ambisonics order. For the amplitude panning methods, the fluctuation of the perceived source width across source positions (VBAP) and the minimally achievable source width that is independent of the source position (MDAP) increase with L.

Both approaches—vector-based and Ambisonics panning—have distinct disadvantages at very low orders \(N \lesssim 2\), i.e., for arrays consisting of only about four to nine loudspeakers. In this case, Ambisonics and MDAP have a rather limited spatial resolution and Ambisonics additionally exhibits a dull sound color. For VBAP, on the other hand, the source width heavily depends on the position of the virtual source. Using state-of-the-art Ambisonics decoders, the differences between the approaches decrease at orders \(N \gtrsim 3\), i.e., for arrays consisting of 16 loudspeakers or more. For such arrays, all methods are able to produce virtual sources whose width and loudness are independent of the source position. For an in-depth discussion of these properties the interested reader is referred to Zotter and Frank [89, Chaps. 1 and 3] and Pulkki et al. [64, Chap. 5].

4.2.2 SIRR, SDM, and DirAC

Different versions of SIRR and DirAC have been proposed over the past years. The two most advanced versions are the so-called Virtual Microphone DirAC, which improved the rendering of diffuse sound field components over the original DirAC version, and higher order DirAC/SIRR, which make it possible to estimate more than one directional component for each time frame to improve the rendering of challenging acoustic scenes [53, 64, Chaps. 5 and 6]. For an array consisting of 16 loudspeakers that are set up in acoustically treated environments (anechoic or very dry), SIRR and DirAC can achieve a high audio quality of about 80–90% on a MUSHRA-like rating scale (cf. Sect. 5.2.1.1). Best results are obtained for idealized microphone array signals, i.e., if the SIRR/DirAC input signals are synthetically generated instead of recorded with a real microphone. Using a real microphone array decreased the audio quality by about 10% on average.

Similar audio qualities were obtained for SDM [78] and binaural SDM [2]. The latter study showed that binaural SDM has a plausibility score similar to sound fields emitted by real loudspeakers. Although the plausibility score differs from the definition of plausibility in Sect. 5.2.2.2, it is reasonable to assume that SDM—and also SIRR and DirAC—can be plausible, however, not authentic.

So far, perceptual evaluations were conducted in acoustically treated listening rooms and it is plausible to expect that the quality decreases with an increasing degree of reverberation in the listening environment. Moreover, a comprehensive comparative evaluation of SIRR and SDM is missing to date and existing studies sometimes used test conditions that might have favored one approach over the others.

SIRR, SDM, and DirAC might be the most common, but by far, not the only methods for perceptually motivated SFS. Broader overviews are, for example, given by Pulkki et al. [64, Chap. 4] and Zotter and Frank [89, Sect. 5.8].

4.3 Binaural Reproduction of Synthesized Sound Fields

As mentioned before, SFS approaches can be reproduced via headphones if virtualizing the loudspeaker array with a set of HRTFs. The virtualization is uncritical if the number of virtual loudspeakers can be freely selected, which often is the case for SIRR, SDM, and DirAC. The situation is more difficult, however, for Ambisonics signals which are typically order limited to \(1\lesssim N \lesssim 7\). The challenge in this case is to derive an Ambisonics version of the HRTF data set with the same order restriction. Without specifically tailored algorithms, an order of \(N \approx 35\) is required for an authentic Ambisonics representation of HRTFs and simply restricting the order causes clearly audible artifacts [3].

Fig. 5.16
figure 16

Perceived differences between a reference and order limited binaural renderings of microphone array recordings. For details refer to [49] (CC-BY, Tim Lübeck)

A variety of methods have been proposed to mitigate these artifacts. This comprises a global spectral equalization with or without windowing (tapering) of the spherical harmonics coefficients or a separate treatment of the HRTF phase by means of (frequency-dependent) time alignment or finding an optimal phase that reduces errors in the HRTF magnitude [3, 89, Sect. 4.11]. A comparative study of these algorithms was conducted by Lübeck et al. [49]. As shown in Fig. 5.16, the differences between a reference and binaural renderings are small already for \(N=3\), at least for the best algorithms.

Another benefit of headphone reproduction is that different reproduction techniques can be combined to fine-tune the trade-off between perceptual quality and computational efficiency. One possible solution is to use HRTFs with a high spatial resolution for direct sound rendering (high computational cost, high quality) combined with Ambisonics-based rendering of reverberant components (cost and quality adjustable by means of the SH order) [16]. This exploits the fact, that the spatial resolution of the auditory system is higher for the direct sound than for reverberant components [50].

5 Conclusion

Section 5.2 gave an overview of existing quality measures for evaluating 3D audio content and it became apparent that the underlying concepts can also be used to assess audio quality in audiovisual virtual reality. Good suggestions were made to adapt the application of these measures for AR/VR by simplifying the associated rating interfaces and/or adapting methods for the statistical analysis. Open questions in this field mainly seem to relate to the higher level constructs of QoE and presence. It will be interesting to see how these can be measured with less intrusive user interfaces or—in the best case—with indirect physiological or psychological measures. If such methods would be established, it would also be possible to further investigate how far these higher level constructs are affected by specific aspects of audio quality.

Sections 5.3 and 5.4 introduced selected approaches for generating 3D audio for AR/VR and reviewed their quality. The current best practice of using non-individual binaural synthesis with compensated headphones for audio reproduction can generate plausible simulations and can significantly benefit from additional information provided by 3D visual content. Recent advances in signal processing fostered the combination of SFS and binaural reproduction. This improved the efficiency—a key factor for enabling 3D audio rendering in mobile applications—without introducing significant quality degradations. One current hot topic in the combination of SFS and binaural reproduction is clearly 6DoF rendering. Many algorithms were suggested for this, however, their development and even more so their perceptual evaluation are still under investigation in the majority of cases. The interested reader may have a look at recent articles as a starting point for discovering this field (e.g., [4, 41]). A second hot topic is the individualization of binaural technology. The effects of individualization were discussed and it was shown that this makes it possible to create simulations that are perceptually identical to a real sound field. Approaches for individualization were, however, not detailed and the interested reader is referred to the overview of Guezenoc and Renaud [25].

From the user perspective, it is worth to note that an increasing pool of software and hardware is available for 3D audio reproduction.Footnote 3  State-of-the-art audio processing and reproduction methods are available as plug-ins that can easily be integrated into the production workflow as well as in toolboxes that can be used for further research and product development. This is complemented by VR/AR-ready hardware such as microphone arrays as well as head-mounted displays and headphones with build-in head trackers.