Crossmodal hierarchical predictive coding for audiovisual sequences in the human brain

Huang, Yiyuan Teresa; Wu, Chien-Te; Fang, Yi-Xin Miranda; Fu, Chin-Kun; Koike, Shinsuke; Chao, Zenas C.

doi:10.1038/s42003-024-06677-6

Crossmodal hierarchical predictive coding for audiovisual sequences in the human brain

Article
Open access
Published: 09 August 2024

Volume 7, article number 965, (2024)
Cite this article

Download PDF

You have full access to this open access article

Communications Biology

Crossmodal hierarchical predictive coding for audiovisual sequences in the human brain

Download PDF

465 Accesses
2 Altmetric
Explore all metrics

Abstract

Predictive coding theory suggests the brain anticipates sensory information using prior knowledge. While this theory has been extensively researched within individual sensory modalities, evidence for predictive processing across sensory modalities is limited. Here, we examine how crossmodal knowledge is represented and learned in the brain, by identifying the hierarchical networks underlying crossmodal predictions when information of one sensory modality leads to a prediction in another modality. We record electroencephalogram (EEG) during a crossmodal audiovisual local-global oddball paradigm, in which the predictability of transitions between tones and images are manipulated at both the stimulus and sequence levels. To dissect the complex predictive signals in our EEG data, we employed a model-fitting approach to untangle neural interactions across modalities and hierarchies. The model-fitting result demonstrates that audiovisual integration occurs at both the levels of individual stimulus interactions and multi-stimulus sequences. Furthermore, we identify the spatio-spectro-temporal signatures of prediction-error signals across hierarchies and modalities, and reveal that auditory and visual prediction errors are rapidly redirected to the central-parietal electrodes during learning through alpha-band interactions. Our study suggests a crossmodal predictive coding mechanism where unimodal predictions are processed by distributed brain networks to form crossmodal knowledge.

A quantitative model reveals a frequency ordering of prediction and prediction-error signals in the human brain

Article Open access 10 October 2022

Spatiotemporal brain hierarchies of auditory memory recognition and predictive coding

Article Open access 21 May 2024

Perceptual learning shapes multisensory causal inference via two distinct mechanisms

Article Open access 19 April 2016

Introduction

Predictive-coding theory postulates that the brain is constantly predicting upcoming events based on an internal model established by learning the statistical regularity in sensory experience^1,2. This is achieved by a bidirectional and hierarchical signaling cascade, where the top-down prediction represents the information inferred from prior knowledge to explain away the input, and the bottom-up prediction error is generated to refine the prediction when there is an unexplained discrepancy. This coding is thought to be the basis for diverse cognitive processes, such as perceptual inference, reinforcement learning, and motivational attention^3,4,5.

Considerable evidence of predictive coding has been provided within single sensory modalities. For example, the mismatch negativity (MMN), a surprise response to oddball stimuli, has been interpreted as a prediction-error signal in the domains of audition^6,7,8, vision^9,10 and tactility^11,12. The hierarchical organization of the unimodal predictive coding has also been examined, for example, by the local-global oddball paradigm with auditory¹³ and visual stimuli¹⁴. In this paradigm, regularities were controlled at the stimulus-to-stimulus level (local regularity) and the multi-stimulus sequence level (global regularity). The hierarchical prediction-error signals have been identified in both monkeys and humans^15,16,17, and hierarchical prediction signals have been extracted with a quantitative predictive-coding model¹⁸.

To further generalize the predictive-coding theory, it is critical to verify how predictive coding can operate across multiple sensory modalities, particularly, how prior knowledge can be transferred across sensory domains. During audiovisual events, it has been found that visual information can facilitate the early auditory response, suggesting a prediction established by the crossmodal transition from the visual to auditory domains and thus speeding up the process of upcoming auditory input^19,20. Furthermore, the concept of a crossmodal prediction error comes into play once a crossmodal transition is violated. For instance, a visual stimulus V followed by an auditory stimulus A is learned, and a crossmodal prediction error is generated when this expected transition is disrupted by replacing A with a different auditory stimulus A’. This error, which arises when visual information is inadequate to infer upcoming auditory stimuli, serves to explain phenomena such as the McGurk fusion and combination of audiovisual speech^19,21,22. However, it remains unclear how crossmodal transitions are predicted at different hierarchical levels, and how sensory information from different modalities converges along the hierarchy that leads to the emergence of crossmodal prediction errors.

To address this, we use a crossmodal audiovisual local-global paradigm in which auditory and visual stimuli are delivered one at a time in sequences, and the unimodal and crossmodal transitions are controlled at two hierarchical levels of temporal regularities. By delivering stimuli one at a time, we can independently control the occurrences of both unimodal and crossmodal transitions, which cannot be done if both auditory and visual stimuli are presented simultaneously. To examine how the brain learns and represents unimodal and crossmodal transitions at the two levels, we record electroencephalogram (EEG) and analyze the data with a model-fitting approach¹⁸. Three predictive-coding models are tested: one with auditory and visual processes not integrated, one with audiovisual integration at the global level, and one with audiovisual integration at both the local and global levels. The best-fitting model reveals that audiovisual integration occurs at both levels, indicating a convergence of the auditory and visual information to create crossmodal predictions, even at the lower level of stimulus-to-stimulus transitions. We further show that the auditory and visual prediction-error signals are rapidly re-routed from the electrodes around the contralateral temporal and parieto-occipital areas to the electrodes around the central-parietal area during learning via the alpha-band interactions. Our findings demonstrate that the principles of hierarchical predictive coding, initially understood in the context of unimodal processing, can also comprehensively describe the learning and encoding of multilevel statistics observed in crossmodal transitions, with the engagement of more widespread brain networks.

Results

A crossmodal audiovisual local-global oddball paradigm

We designed a crossmodal audiovisual local-global oddball paradigm in which crossmodal transitions were created between a tone (denoted as A) and a checkerboard (denoted as V). These two stimuli were arranged to create four sequences: AAA, AAV, VVV, and VVA. Within each sequence, all stimuli were presented for 200 msec with a 300 msec stimulus onset asynchrony. A random interval of 1600–1900 msec was inserted between the offset of one sequence and the onset of the next sequence (Fig. 1a). Four sequence blocks were created, each one with 120 trials of stimulation sequences but with distinct predictability in the crossmodal transitions. In each block, a frequent sequence (100 trials) and an infrequent sequence (20 trials) were assigned, with the first two stimuli in both sequences being of the same modality and the last being of different modalities (Fig. 1b). Each block was presented twice: once with both the visual and auditory stimuli delivered from the left side, and once with both stimuli delivered from the right side (Fig. 1a). The purpose of this lateral delivery was to enable post-hoc confirmation of the lateralization in sensory responses, such as whether EEG responses to an image in the left visual field would appear mostly at the electrodes around the right occipital areas (we called these “right occipital electrodes” based on the areas they covered). Participants were asked to pay attention to the stimuli while maintaining eye fixation on a white dot.

**Fig. 1: Crossmodal audiovidual local-global oddball paradigm.**

During the experiment, 64-channel EEG signals were recorded, and no task-related behavioral responses were required. The EEG data were preprocessed to reduce artifacts and impact of volume conduction (see details in Methods). For each trial, channel, and participant, we quantified the event-related spectral perturbation (ERSP) by calculating the time-frequency representation with baseline normalization (the decibel method; baseline: –0.2 to 0 s, relative to the first stimulus). In trials where stimulation was applied to the left side, the channel assignments for ERSP were mirrored from left to right. This mirroring aimed to align the sensory responses with trials where stimulation was delivered to the right side and ensured the left hemisphere to serve as the contralateral side in both scenarios. After the mirroring, the ERSPs were then averaged across trials for each trial type in each block, and for each participant and each channel.

Univariate analysis of neural responses for audiovisual integration

First, we used a univariate analysis to evaluate whether audiovisual integration occurred in our crossmodal paradigm. Importantly, we define audiovisual integration as the neural processes where auditory and visual stimuli influence each other^23,24, without necessarily converging into a singular higher-level concept, such as the audiovisual speech perception²⁵. Under this definition, if integration occurred, its degree would be different between Blocks 1 and 2 due to different numbers of A-to-V transitions. Therefore, the neural responses to AAA would differ in these two blocks. Similarly, the neural responses to VVV would differ in Blocks 3 and 4 due to different numbers of V-to-A transitions.

To evaluate this, we contrasted the averaged ERSPs: AAA in Block 1 – AAA in Block 2, and VVV in Block 3 – VVV in Block 4. We first identified significant contrast responses across participants for each channel by bootstrap analysis (n = 30 participants, 1000 resamples, alpha = 0.05, two-tailed, with false discovery rate correction). An example of the significant contrast response at channel Pz is shown in Fig. 2a (examples from other channels Fcz, C3 and Pz are shown in Supplementary Fig. 1). For each of the two contrasts, we calculated the overall significant contrast responses by averaging significant values across all channels, replacing non-significant ones with zeros. From the overall significant contrast responses (shown in Fig. 2b), we observed that the AAA contrast exhibited positive differences in the theta and gamma frequency bands, which were followed by negative differences in the alpha and beta bands. Conversely, the VVV contrast displayed only negative differences in the alpha and beta bands. Overall, significant negative alpha and beta power responses were identified in both contrasts.

**Fig. 2: Validation of audiovisual integration.**

To further evaluate significant differences in EEG topography, we averaged the contrast responses across time points (0–0.9 s from the last stimulus onset) and within specific frequency bands—theta (4–8 Hz), alpha (8–13 Hz), beta (13–30 Hz), and gamma (30–100 Hz)—for each contrast, channel, and participant. Then, we identified significant responses across participants for each channel by permutation analysis (n = 30 participants, 1000 permutations, alpha = 0.05, two-tailed, cluster-based correction). For the AAA contrast (Fig. 2c), significant differences primarily appeared in the alpha and beta bands at the parieto-occipital electrodes, indicating that a region typically associated with visual processing was active during the processing of purely auditory stimuli. Conversely, for the VVV contrast, significant differences were primarily observed in the alpha band at both the frontal and parieto-occipital electrodes, and in the beta band at the central and parieto-occipital electrodes. This implies that visual activations may move toward the frontal and central scalp distributions in the crossmodal context.

To further confirm the presence of audiovisual integration in the paradigm, we compared the responses to AAA from the crossmodal paradigm to those from a classical auditory local-global oddball paradigm (see details in Methods). The results, depicted in Supplementary Fig. 2, reveal a consistent disparity in the alpha band within the parieto-occipital region, similar to what is shown in Fig. 2c.

Three hypotheses of crossmodal transitions across hierarchy

After confirming the integration of audio and visual stimuli in our paradigm, we further investigate how the crossmodal knowledge can form across hierarchical levels. In our paradigm, the regularities of crossmodal transitions can be established at two hierarchical levels within a single block. The local regularity is established based on the stimulus-to-stimulus transition probability (TP), defined as the conditional probability of a particular stimulus occurring, given the preceding stimulus. We denoted the probabilities of the crossmodal A-to-V and V-to-A transitions as TP_A→V and TP_V→A, respectively. On the other hand, the probabilities of the unimodal A-to-A and V-to-V transitions were denoted as TP_A→A and TP_V→V, respectively. At a higher level, the global regularity is established based on the sequence probability (SP), defined as the probability of sequences AAA, AAV, VVV, or VVA within a block. These were denoted as SP_AAA, SP_AAV, SP_VVV, and SP_VVA, respectively.

To understand how the brain processes local and global regularities in the crossmodal sequences, we first need to know how it integrates auditory and visual information. For instance, if no audiovisual integration occurs during the processing of two consecutive stimuli (local integration), then the probabilities of A-to-V and V-to-A transitions cannot be determined, resulting in TP_A→V and TP_V→A being zero. Similarly, without audiovisual integration at the sequence level (global integration), AAV and VVA cannot be recognized as single sequences, and therefore, SP_AAV and SP_VVA are zero. Here, we present three hypotheses on how the brain integrates auditory and visual information at the local and global levels, which were subsequently verified using EEG data. To elucidate these hypotheses, we use the processing sequences of AAA and AAV as illustrative examples (Fig. 3a).

**Fig. 3: Hypothesis and Crossmodal hierarchical predictive coding models.**

Hypothesis 1: Auditory and visual information are processed separately, but instead are processed independently within their respective auditory and visual streams. Since the auditory and visual streams are independent from each other, the auditory stream perceives AAA and AAV as AAA and AA, respectively (Fig. 3a). In this scenario, TP_A→V equals zero, and TP is determined based on the A-to-A transition (TP_A→A) and A-to-O transition (TP_A→O), where O represents omissions at the end of the sequence. It is important to note that the A-to-O transition needs to be considered, since at the stimulus-to-stimulus transition level, the last A in a sequence will continue to predict the next stimulus item (including omission)¹⁸. This need can be illustrated by a simplified example comparing two sequences without a global structure: AAAAAAAA and AAAAOAAA. If omissions are not considered as events, TP will be the same for both sequences. However, the anticipation of the upcoming A tone clearly differs between these two scenarios because the omission is anticipated in the second sequence. Similarly, SP_AAV equals zero, and SP applies only to sequences AAA and AA (thus SP_AAA and SP_AA are nonzero). This hypothesis presents an unlikely scenario, and it serves as a benchmark for other hypotheses.

Hypothesis 2: Auditory and visual information are processed independently at the local level but are integrated at the global level. In this scenario, auditory and visual information are processed separately to determine TP, as in Hypothesis 1, but are processed jointly to determine SP. Consequently, AAA and AAV are considered as sequences (thus SP_AAA and SP_AAV are nonzero), and there is no sequence AA (SP_AA is equal to zero).

Hypothesis 3: Auditory and visual information are integrated at both the local and global levels. In this scenario, auditory and visual information are combined to determine both TP and SP. Consequently, the A-to-V transition is now considered, making TP_A→V nonzero. SP is determined in the same manner as in Hypothesis 2.

Therefore, the TP and SP values vary not only across the blocks but also across the three hypotheses, as detailed in Table 1.

Table 1 Transition and sequence probabilities across blocks and models

Full size table

Quantitative models for evaluating the hypotheses

In order to evaluate these hypotheses using EEG data, we need to quantify the interplay among sensory, prediction, and prediction-error signals within the varying configurations of audiovisual integration proposed in each hypothesis. To achieve this, we adapted a hierarchical predictive-coding model originally designed to elucidate the auditory local-global paradigm^15,18, and incorporated varying interactions between auditory and visual streams. Each of the three hypotheses was represented by a corresponding model, each featuring unique structural elements and signal flows (Fig. 3b).

All three models consist of two key structures: the auditory and visual sensory streams (A and V streams), and three hierarchical levels (Levels S, 1, and 2). Level S receives sensory input, and Levels 1 and 2 encode the local and global regularities, respectively. Using the A stream as an example, the neuronal population at Level S (denoted as A₀) receives an auditory input (denoted as S_A) and a top-down first-level prediction of that input (denoted as P1_A) from Level 1. When the stimulus is A, S_A = 1, and otherwise, S_A = 0. The first-level prediction error (denoted as PE1_A), which is the absolute difference between S_A and P1_A, is sent to Level 1. At Level 1, the neuronal population (denoted as A₁) receives PE1_A and a second-level prediction (denoted as P2_A) from Level 2. The second-level prediction error (denoted as PE2_A), which is the absolute difference between PE1_A and P2_A, is sent to the neural population at Level 2 (denoted as A₂). The V stream functions similarly, and there are first-level and second-level predictions and prediction errors (denoted as P1_V, P2_V, PE1_V, and PE2_V). S_V is set to 1 when the stimulus is V and 0 otherwise.

The key distinction among the three models is whether auditory and visual streams are integrated at Level 1 (local level) and Level 2 (global level) (Fig. 3b):

Model 1: To model Hypothesis 1, which proposes no audiovisual integration, this model features discrete auditory and visual streams, and the auditory and visual signals operate independently.

Model 2: To model Hypothesis 2, which proposes audiovisual integration at the global level, this model incorporates an integration layer at Level 2 (shown as a horizontal orange bar). Within the integration layer, the auditory and visual streams are “linked”, allowing for crossmodal sequences to be considered for SP.

Model 3: To model Hypothesis 3, which proposes audiovisual integration at both the local and global levels, this model incorporates integration layers at both Levels 1 and 2. The integration layer at Level 1 allows for crossmodal A-to-V and V-to-A transitions to be considered in TP.

Theoretical values of prediction and prediction-error signals

Next, we examined the steady-state values of that each signal would reach once TP and SP have been learned. Importantly, our analysis focused only on the signals generated by the last stimulus in each sequence, since all sequences in a block shared the same first two stimuli and crossmodal transitions occurred only at the last stimulus. We assessed the steady-state values of the prediction signals (P1_A, P2_A, P1_V, and P2_V) for the last stimulus with the principle outlined by Chao et al.¹⁸: The optimal value for each prediction signal is determined by minimizing the mean squared prediction error within the corresponding neural population. Utilizing this principle, once determining TP and SP, we can derive the values for all relevant prediction and prediction-error signals for each sequence (see further details in Methods).

To further elucidate model values in three models, we use sequences AAV and AAA in Block 1 as illustrative examples (Fig. 3c) (comprehensive model values for all blocks can be found in Supplementary Fig. 3). Here, we highlight several key points:

The values of the prediction signals are identical for sequences AAV and AAA. This is because the prediction of the upcoming last stimulus does not depend on the identity of the last stimulus.
In Models 1 and 2, local audiovisual integration is absent, and TP_A→V = 0. Therefore, the last stimulus V in AAV is unpredictable based on stimulus transitions, and local prediction P1_V = 0.
In Model 1, global audiovisual integration is absent, and SP_AAV = 0. Therefore, the last stimulus V in AAV is unpredictable based on sequence structure, leading to global prediction P2_V = 0.
In the integration layer where two streams are linked, different prediction signals are sent down in the auditory and visual streams to minimize their respective prediction errors. For example, in Model 2, the values of P2_A and P2_V differ since they aim to minimize different prediction errors (i.e., PE2_A and PE2_V, respectively).

To extract the prediction error, we conducted within-block contrasts: contrasting the model values between sequences AAV and AAA (AAV – AAA) in Blocks 1 and 2, and the model values between sequences VVA and VVV (VVA – VVV) in Blocks 3 and 4. These contrasts eliminated the prediction signals, which are identical for both sequences within a block, and contained only the contrast values of sensory and prediction-error signals. An illustrative example of these contrast values in Block 1 is shown in Fig. 3d.

For the integration layers, we propose that the neuronal populations from both streams would spatially cluster in the brain to facilitate effective information integration. This clustering would resemble the specialized subareas for processing specific modalities in the multisensory superior temporal sulcus^26,27. In the model, neuronal populations are grouped across both auditory and visual streams to represent this clustering (Fig. 3d). This is especially crucial in Model 3. In Model 3, PE2_A and PE2_V are collectively transmitted from grouped neuronal populations at Level 1 to grouped neuronal populations at Level 2. Consequently, we combined them into a single composite signal, PE2_A/V ( = PE2_A + PE2_V), based on the assumption that they cannot be spatially separated due to the limited resolution of EEG.

The comprehensive contrast values for all blocks are shown in Fig. 3e. Additionally, we constructed alternative models featuring modified elements related to integrated neuronal populations and signals (see details in Supplementary Fig. 4). These alternative models were subsequently evaluated. Moreover, we calculated the TP under the assumption that the last stimulus (e.g., last A in AAA) will continue to predict the next stimulus item (transition from the last A to the omission). We also tested models with different TP calculations: (1) excluding A-to-O transitions in AAA and V-to-O transitions and (2) considering the overall occurrences of a single stimulus rather than the transition between two stimuli.

Model evaluation with EEG data

Next, we tested the three hypotheses by quantifying how well the modeled contrast values (as in Fig. 3e) fit the EEG data. To capture prediction-error signals in accordance with the model shown in Fig. 3d, we contrasted the averaged ERSPs from different trial types: AAV – AAA in Blocks 1 and 2, and VVA – VVV in Blocks 3 and 4. Significant contrast responses were identified as those significantly different from zero using bootstrap analysis (n = 30 participants, 1000 resamples, alpha = 0.05, two-tailed, with false discovery rate correction), and non-significant responses were replaced with zeros. An example of a significant contrast response is shown in Fig. 4a, b shows the overall significant contrast responses, averaged across all channels for each block. These responses were mainly observed after the last stimulus, exhibiting varying patterns across blocks (also see Supplementary Result 1 for further details). We then pooled all the significant contrast responses into a tensor with three dimensions: Block (4 blocks), Channel (60 channels), and Time-Frequency (450 time points × 100 frequency bins).

**Fig. 4: Significant contrast responses.**

To fit the 3D tensor with the models, we used a decomposition approach to break down the tensor into various components of sensory and prediction-error signals, based on their corresponding contrast values in the models. For example, in the case of Model 1, we restricted the tensor to break down into 6 designated components: S_A, S_V, PE1_A, PE1_V, PE2_A, and PE2_V. Each of these components was activated across all four blocks, precisely matching the activation profiles shown in Fig. 3e. This was achieved by fixing the Block dimension to the modeled contrast values in the parallel factor analysis (PARAFAC), a high-dimensional decomposition method^28,29. We performed PARAFAC 50 times, each with a random initialization.

Following this, we assessed which model offered a better explanation of the data. First, we used the core consistency diagnostic (CORCONDIA) to quantify the interaction between components²⁹. A higher CORCONDIA indicates less interaction between the components, indicating a more appropriate fit for the decomposition^28,30. In Fig. 5a, CORCONDIA was found to be 0 ± 0.1% in Model 1, 14.8 ± 8.8% in Model 2, and 83.3 ± 0.6% in Model 3 (mean ± standard deviation, n = 50 random initializations). Model 3 showed a significantly higher CORCONDIA value compared to Model 1 (n = 50 random initializations for each, two-sample t test, alpha = 0.05, two-tailed, Bonferroni correction) (t[98] = 1005.1, p < 1e–16) and Model 2 (t[98] = 35.29, p < 0.001). Additionally, we assessed the model’s goodness of fit using the Bayesian Information Criterion (BIC). A lower BIC value suggests a better balance between the model’s fit and its complexity. Model 3 showed the lowest BIC value compared to Model 1 (t[98] = −8.66, p < 1e–14) and Model 2 (t[98] = −9.38, p < 1e–15). Furthermore, Model 3 also outperformed the other alternative models outlined in Supplementary Fig. 4 (see Supplementary Table 1) and models with different TP (Supplementary Table 2).

**Fig. 5: Model comparison and reconstruction.**

These results consistently indicated that Model 3 was the best-fitting model. To further substantiate this, we reconstructed the significant contrast responses using the 5 components derived from Model 3 (see details in Methods) and demonstrated that Model 3 was effective in capturing the diverse responses across different channels and blocks (Fig. 5b).

Hierarchical prediction-error signals extracted from Model 3

Figure 6a shows the 5 components in Model 3 (S_A, S_V, PE1_A, PE1_V, and PE2_A/V) extracted from PARAFAC. Each component extracted from PARAFAC is represented by the score or loading matrix (activation values) along the original dimensions: Block, Channel, and Time-Frequency. The first dimension was anchored to the four modeled contrast values. The remaining two dimensions captured how the component was activated across 60 channels (Channel), and across 450 time points and 100 frequency bins (Time-Frequency), describing its spatio-spectro-temporal signature.

**Fig. 6: Neural components extracted from the best-fitting model.**

S_A, the sensory signal in the auditory stream, was found at the bilateral temporal and central electrodes with increases in the theta (4–8 Hz), alpha (8–13 Hz), beta (13–30 Hz), and gamma (30–100 Hz) bands after the last stimulus. S_V, the sensory signal in the visual stream, was found at the contralateral parieto-occipital electrodes with increases in the theta and gamma bands after the last stimulus and then decreases in the alpha, beta, and gamma bands.

PE1_A, the first-level prediction-error signal in the auditory stream, was found at the bilateral temporal and central electrodes with increases in the theta, alpha, beta, and gamma bands and then a decrease in the gamma band. PE1_V, the first-level prediction-error signal in the visual stream was found at the bilaterally parieto-occipital electrodes with increases in the theta band and gamma bands and then decreases in the alpha and beta bands. PE2_A/V, the second-level prediction-error signal in the integrated stream, was found at the central-parietal electrodes with an increase in the gamma band and then decreases in the theta, alpha, and gamma bands, and a long-lasting decrease in beta band oscillations.

Spatially, S_V was activated more in the contralateral area due to the lateral stimulation delivery. On the other hand, S_A was activated in both hemispheres, which could result from less focal sound delivery by the speakers compared to headphones. For the prediction errors, PE1_V appeared in the bilateral scalp distribution and PE1_A appeared at the central electrodes, suggesting that the sensory signals located at the electrodes around the contralateral sensory areas now propagated to the central electrodes and both contralateral and ipsilateral electrodes. Spectrally, we measured the peak value for each component in each frequency band the 0–0.9 s time window (Fig. 6b). All five components exhibited higher values in the gamma band. Temporally, we focused on the dynamics of gamma-band signals by averaging the values across frequency bins within the gamma band for each component (Fig. 6c). The peak latencies for S_A, S_V, PE1_A, PE1_V, and PE2_A/V were 136 msec, 192 msec, 332 msec, 324 msec, and 328 msec, respectively. These latencies suggested a sequential bottom-up signal propagation from sensory input to prediction error, in line with signal flows in our model. Also, we could also observe the first-level prediction-error components inherit spatial and spectral characteristics of the sensory components, e.g., S_A and PE1_A both contained positive power responses in all frequency bands after the last stimulus onset in the bilateral temporal and central electrodes. Peak latencies for other frequency bands are provided in Supplementary Fig. 5.

Crossmodal integration during learning

The PARAFAC topographic results suggested that the unimodal prediction-error signals, PE1_A and PE1_V, were redirected to the central-parietal electrodes to generate the crossmodal prediction-error signal PE2_A/V. To explore how this integration connectome was established during learning, we assessed the functional connectivity between the electrodes involved in prediction-error signaling across different stages within a block. For each component (PE1_A, PE1_V, or PE2_A/V), we selected the five channels exhibiting the highest values based on the PARAFAC results in the Channel dimension. This resulted in 25 connections between PE1_A and PE2_A/V. Meanwhile, there were only 24 connections between PE1_V and PE2_A/V, since there was one channel shared between them (Fig. 7a). To assess learning, we divided the trials within a single block into 11 continuous time windows. Trials from each of these windows were then grouped across all four blocks.

**Fig. 7: Functional connectivity during learning.**

To measure spectral functional connectivity, we used pairwise phase consistency (PPC), a method measuring phase synchronization between two signals across trials while minimizing bias related to the number of trials³¹. For each connection, time window, and participant, we calculated PPC values and then averaged them across the relevant connections (either 25 or 24, depending on the components involved). Fig. 7b shows examples of these averaged PPC values for the connections between PE1_A and PE2_A/V across three time windows in one participant. Then, these averaged PPC values were further averaged across frequency bins within four distinct frequency bands (theta, alpha, beta, and gamma) and across time points (0–0.9 s after the last stimulus). For the connections between PE1_A and PE2_A/V, we compared the averaged PPC values across adjacent time windows for each frequency band to illustrate changes in functional connectivity throughout the learning stages (Fig. 7c). Similarly, for connections between PE1_V and PE2_A/V, the averaged PPC values were compared to depict changes of functional connectivity.

The pairwise comparisons revealed a significantly positive difference in the alpha band between PE1_A and PE2_A/V (n = 30 participants, 1000 resamples, alpha = 0.05, two-tailed, Bonferroni correction). Significantly positive differences were observed in the theta and alpha bands between PE1_V and PE2_A/V. These results showed that the alpha-band interactions between the central-parietal electrodes and the electrodes representing unimodal prediction-error signals gradually strengthened during learning, suggesting that the central-parietal electrodes gradually emerged as a hub for integrating crossmodal information.

Discussion

We combined a crossmodal experimental paradigm with a hypothesis-driven analysis to examine hierarchical crossmodal predictive coding for sequence processing. We demonstrated that hierarchical crossmodal predictions can be established by learning TP and SP in a higher-order brain region that integrates information from different sensory streams. Our finding provides a generalized framework for crossmodal predictive coding and reveals neural signatures of its representation and learning.

From the predictive coding perspective, convergence of sensory inputs along the hierarchy plays a critical role to minimize prediction errors. To demonstrate this, we calculated the theoretical values of the overall prediction error for each block in each model; that is the minimum of mean-squared prediction error, the principle in model value calculation. For each block (e.g., Block 1), the squared values of PE1_A, PE1_V, PE2_A, and PE2_V of each trial type (e.g., AAV or AAA) were weighted by the probability of its occurrence and then summed together. This summed value represents the overall prediction error, which was the same between Blocks 1 and 3, since the model values were the same in these two blocks with just the two sensory streams swapped. Similarly, the overall prediction error was the same between Blocks 2 and 4. The overall prediction errors in Model 3 were found to be the smallest among the three models (Supplementary Fig. 6). Convergence early at functional level of stimulus-to-stimulus transitions enables the brain to achieve the smallest sum of prediction errors, which is in line with the general framework of predictive coding theory.

The central-parietal electrodes, which showed minimal overlap with the extracted auditory and visual sensory signal areas (i.e., S_A and S_V), may facilitate the integration of crossmodal information, as suggested by our activity and connectivity analyses. The connectivity strength between the central-parietal electrodes and the electrodes of two unimodal prediction-error signals increased during learning, particularly in the alpha frequency. We also observed reduced alpha-beta power when comparing crossmodal and unimodal contexts, as well as during the processing of unimodal sequences within crossmodal contexts. This aligns with the decreased beta-band power reported in previous studies on multimodal tasks^32,33. Additionally, the PE2_A/V component exhibited a prolonged negative deflection in the beta band. The desynchronization of the alpha-beta oscillation also suggests a top-down transmission to update predictions for further error minimization^15,34,35. These findings imply that the central-parietal electrodes, as a hub, receives audiovisual information and updates crossmodal predictions, carried via alpha-beta oscillations. Furthermore, gamma oscillations are thought to facilitate bottom-up transmission of prediction errors from lower to higher neuronal layers^36,37,38. Additionally, the temporal sequence of peak gamma oscillations observed in S_A and S_V, followed by those at the central-parietal electrodes of PE2_A/V, suggests a directional flow of information from sensory to higher-order regions over time.

However, we also observed theta and alpha oscillations evoked around the onset of gamma oscillations in the components of both sensory inputs and unimodal prediction errors. In contrast, reduced theta and alpha oscillations were observed in the components of crossmodal prediction errors, indicating less coupling between the low and high frequency oscillations. Evidence of phase-amplitude cross-frequency coupling has been discovered in connections among different brain regions and large-scale networking between regions such as the hippocampus and prefrontal areas^39,40,41,42. Further advanced recording and methodologies such as Electrocorticography (ECoG) in the hippocampus area and whole cortical surface can be combined with cross-frequency coupling analysis to verify if lower- and higher-hierarchy crossmodal processing involves different ranges of distributed networks.

In addition, while central-parietal activation in EEG may reflect activity in the multisensory superior temporal areas in imaging, intracranial EEG, and MEG studies in the crossmodal context^43,44,45, further recording modalities should be considered to verify where crossmodal predictive coding occurs across the hierarchical structure. Moreover, studies have shown that convergence of sensory processing across modalities can occur already in subcortical and lower cortical areas. For example, inferior and superior colliculus have both been found to receive inputs of different sensory modalities^46,47. This could also be achieved by interconnections between sensory-specific primary cortices, which have been identified between the auditory and visual areas, and between auditory and somatosensory areas^23,24,27,48. We suggest that early convergence could occur when crossmodal transitions are processed at a short timescale, such as visual speech occurring a few hundred msec ahead and leading to the prediction of acoustical input of speech spectrograms and phonetic features⁴⁹. When considering complex sequences with multiple time scales, probability computations would be operated at a more distributed brain network. Our study used arbitrary stimuli to create two-level temporal regularities, while real-world stimuli can consist of more than two timescales of regularities, e.g. in music and language⁵⁰. It is needed to further reveal this central-parietal hub or a larger brain network processing complex sequences with natural stimuli and thus generalize crossmodal predictive coding in the multi-level hierarchy.

There is still room for further investigation of the hierarchical crossmodal predictive coding model, including probability encoding, neural representations of crossmodal prediction signals, trial-by-trial prediction updates, functional causality between hierarchical prediction-error signals, and interpretation of atypical audiovisual integration. First, the current model’s TP and SP are specifically designed for the local-global paradigm. There is no explicit mechanism explaining why and how these two probability types are encoded. A more systematic and biologically plausible approach would be to encode probabilities conditioned on histories of different timescales. For example, ranging from lower-level probabilities like p(A | A) (equivalent to TP_A→A), to higher-order probabilities like p(A | AA) (equivalent to SP_AAA), p(A | AAA), and so on. To further validate this approach, a paradigm with more diverse stimulus sequences is needed.

Second, crossmodal prediction signals were not estimated directly. We modeled the prediction-error signals when established crossmodal predictions were violated, and degrees of prediction errors also reflected crossmodal predictability. To extract neural representations of prediction signals in order to construct a comprehensive predictive coding network, we would need to break the interdependence between crossmodal predictions and crossmodal prediction errors by considering their timing delay. That is, a prediction is activated before input is received and then a prediction error is evoked after input is received.

Third, while we could provide some estimate of the learning effect by dividing EEG trials in one block into several segments, our model is limited by design to steady-state values, and trial-by-trial dynamics of crossmodal prediction updates over time remains unknown. Although Hierarchical Gaussian Filter is a way to measure the dynamics of how prediction errors are optimized at different levels^51,52, the hierarchical structure cannot be shaped when different timescales of statistical regularities are learned and affect hierarchical interactions. Still, our model with the hierarchical structure and analyses could be greatly improved with the explicit inclusion of learning over time.

Fourth, our study used PPC to investigate connections among the prediction-error components, without directional information. To address it, we could use directed connectivity measures, such as directed transfer function, partial directed coherence, and Granger causality, to estimate causal links from the lower-level prediction-error signals to the higher-level prediction-error signals. Moreover, our model approach is not restricted solely to ERSP and can be applied to not only other recording modalities such as the blood-oxygen-level-dependent (BOLD) signal, but also other analytical metrics, including directed functional connectivity. For example, we can combine this model approach and the contrast of directed functional connectivity to further investigate regional information flows among hierarchical prediction-error signals.

Finally, altered multisensory integration has been used to explain the difficulties of involving in audiovisual communication experienced by individuals with autism spectrum disorder^53,54. In contrast, multisensory integration can allow older people to compensate for declines in physical functions (e.g., reduced visual or auditory acuity) by inferring information from other sensory sources⁵⁵. Our task with a model approach can provide a platform to evaluate multisensory integration from a hierarchical view and offer insights into the neural mechanism about how and why deficits or gains in multisensory integration happen.

Methods

Participants

Thirty participants were recruited for this study (15 males and 15 females; age = 22.0 ± 2.4 years). The inclusion criteria were: (1) age between 20 and 50 years, (2) no difficulty in understanding the experimental procedure due to apparent deficits in cognition, hearing and vision, and (3) no medical history of neurological or psychological diagnosis. Signed consent was received from the participants before the experiment, and all protocols were approved by the research ethics committees of The University of Tokyo (No. 21–372) and National Taiwan University Hospital (No. 201906081RINA). All ethical regulations relevant to human research participants were followed.

Stimuli and task

The auditory stimulus (A) was created with three superimposed sinusoidal waves (500, 1000 and 1500 Hz; volume = 55 dB), and the visual stimulus (V) was a 16*16 green-black checkerboard (size: 4 degree visual angle with a 70 cm viewing distance) presented against a gray background (1920*1220 pixels). These two stimuli were arranged to create four sequences: AAA, AAV, VVV, and VVA with 200-msec duration and 300-msec SOA. A random interval of 1600–1900 msec (50 msec as a step) was inserted between the offset of one sequence and the onset of the next sequence.

Four blocks were respectively assigned a frequent sequence and an infrequent sequence. Block 1 consisted of 100 trials of AAV and 20 trials of AAA. Block 2 consisted of 20 trials of AAV and 100 trials of AAA. Block 3 consisted of 100 trials of VVA and 20 trials of VVV. Block 4 consisted of 20 trials of VVA and 100 trials of VVV. Each block was constructed as follows: The 120 trials were notionally divided into four phases of 30 trials each. Each phase was built from 25 frequent and 5 infrequent sequences in pseudorandom order, yielding an even distribution of trial types within the block.

For unimodal auditory local-global oddball paradigm, two auditory stimuli were created with superimposed sinusoidal waves: 500, 1000 and 1500 Hz for A tone, and 350, 700 and 1400 Hz for A’ tone. These two tones were arranged to create 2 sequences: AAA and AAA’. The setting of stimulation delivery was the same as above, but the sequence AAV in Blocks 1 and 2 was replaced with the sequence AAA’. In the xY Block, as the same stimulus sequence proportion in Block 1, there were 100 trials of AAA’ and 20 trials of AAA. In the xx Block (as Block 2), there were 20 trials of AAA’ and 100 trials of AAA.

Each block was presented twice, once for the left stimulation delivery, where the visual stimuli were delivered 5.28 degree visual angle away from the screen center and the auditory stimuli were delivered from the left speaker, and once for the right stimulation delivery. This led to the total eight block representations (runs) for the main task, and four runs for the unimodal task. We first demonstrated the four runs in a random order, to avoid the carry-over effect of visual stimuli from the main task, and then, the eight runs were delivered randomly. There was a break between runs. A white fixation dot was presented at a visual angle of 3.85 degrees below the screen center. The experimental protocol was programmed with the MATLAB-based PsychtoolBox⁵⁶.

Probability calculation

The values of the transition probability (TP) and the sequence probability (SP) differ across Blocks and Hypotheses. Here, we first demonstrate the calculations of TP and SP for Block 1 in Hypothesis 1. For a single sequence AAA, there are two A-to-A transitions, and one A-to-O transition. For a single sequence AAV, there are one A-to-A transitions, and one A-to-O transition. Therefore, the total number of A-to-A transitions (denoted as TN_A→A) is 2*20 (20 trials of sequence AAA) + 1*100 (100 trials of sequence AAV) = 140. Similarly, the total number of A-to-O transitions (denoted as TN_A→O) is 1*20 (sequence endings for 20 trials of sequence AAA) + 1*100 (100 trials of sequence AAV) = 120. Since there is no A-to-V transition in Hypothesis 1, the total number of A-to-V transitions (denoted as TN_A→V) is zero.

TP_A→A, TP_A→V, and TP_A→O then can be calculated by dividing the number of corresponding transitions by the total number of transitions ( = 140 + 120 + 0 = 260), which results in 140/260, 0, and 120/260, respectively (as shown in Table 1). For SP, there are 20 trials of AAA, 0 trial of AAV (no global integration), and 100 trials of AA. Thus, SP_AAA, SP_AAV, SP_AA are 20/120, 0, 100/120, respectively. The comprehensive calculations are formulized below.

For Block 1 in Hypothesis 1:

$${{TN}}_{A\to A}=2* 20+1* 100$$

(1)

$${{TN}}_{A\to V}=0$$

(2)

$${{TN}}_{A\to O}=1* 20+1* 100$$

(3)

$${{TP}}_{A\to A}={{TN}}_{A\to A}/({{TN}}_{A\to A}+{{TN}}_{A\to V}+{{TN}}_{A\to O})$$

(4)

$${{TP}}_{A\to V}={{TN}}_{A\to V}/({{TN}}_{A\to A}+{{TN}}_{A\to V}+{{TN}}_{A\to O})$$

(5)

$${{TP}}_{A\to O}={{TN}}_{A\to O}/({{TN}}_{A\to A}+{{TN}}_{A\to V}+{{TN}}_{A\to O})$$

(6)

$${{SP}}_{{AAA}}=20/120$$

(7)

$${{SP}}_{{AAV}}=0$$

(8)

$${{SP}}_{{AA}}=100/120$$

(9)

In Hypothesis 2, TPs are calculated based on Equations [1 ~ 6] as in Hypothesis 1. For SP, since there is global integration for sequence AAV, SP_AAA, SP_AAV, SP_AA are 20/120, 100/120, and 0 respectively. In Hypothesis 3, SP are calculated as in Hypothesis 2, and TPs are calculated as in Hypothesis 1, but with different transition numbers. With local integration, TN_A→A is still 140, but TN_A→V is 1*100 (100 trials of sequence AAV) = 100, and TN_A→O is 1*20 (sequence endings for 20 trials of sequence AAA) = 20.

For Block 2, the only difference from Block 1 is the number of trials. In this block, there are 100 trials of AAA and 20 trials of AAV. The calculations of TP and SP can be conducted according to Equations [1 ~ 9], with the numbers 100 and 20 being swapped.

For Blocks 3 and 4, the only difference from Blocks 1 and 2 is the modality. The calculations of TP and SP can be conducted based on the same principle, but with the modality A and V being swapped. For example, TP_A→V is swapped to TP_V→A, and SP_AAV is swapped to SP_VVA.

Model value calculation

The model values were calculated from TP and SP based on Chao et al.¹⁸. To demonstrate this, we use Block 1 as an example (see Fig. 3c). For the A-to-A transition (with the probability of TP_A→A), S_A = 1, and thus PE1_A = | 1 – P1_A | (|.| represents the absolute value). For the other transitions (A-to-O or A-to-V, the probability is 1 – TP_A→A), S_A = 0, and thus PE1_A = | 0 – P1_A | . Therefore, the sum of the mean of squared prediction errors at the auditory stream at the first level, denoted as MSE1_A, is:

$${{MSE}1}_{A}={{{TP}}_{A\to A}* \left(1-{P1}_{A}\right)}^{2}+{\left(1-{{TP}}_{A\to A}\right)* \left({0-P1}_{A}\right)}^{2}$$

(10)

which is a function of P1_A and can be minimized by P1_A = TP_A→A. A similar calculation can be applied to Level 2, where PE2_A = | PE1_A – P2_A | . For sequence AAA (with the probability of SP_AAA), S_A = 1 and PE1_A = | 1 – P1_A | , and thus PE2_A = | | 1 – P1_A | – P2_A | . For the other sequences (AA or AAV, the probability is 1 – SP_AAA), S_A = 0 and PE1_A = | 0 – P1_A | , and thus PE2_A = | | 0 – P1_A | – P2_A | . Therefore, the corresponding mean-squared error, denoted as MSE2_A, is:

$${{MSE}2}_{A}= \, {{{SP}}_{{AAA}}* ({{{\rm{|}}}}1-{P1}_{A}{{{\rm{|}}}}-{P2}_{A})}^{2}\\ +{\left(1-{{SP}}_{{AAA}}\right)* ({{{\rm{|}}}}0-{P1}_{A}{{{\rm{|}}}}-{P2}_{A})}^{2}$$

(11)

which is a function of P2_A and can be minimized by P2_A = SP_AAA*( | 1 – P1_A | ) + (1– SP_AAA)*P1_A. For P1_V and P2_V, similar calculations can be applied with the replacement from TP_A→A to TP_A→V and from SP_AAA to SP_AAV. Furthermore, the same calculations can be applied to Block 2 but with different values of TP and SP. For Blocks 3 and 4, similar calculations are done by replacing TP_A→A with TP_V→A, TP_A→V with TP_V→V, SP_AAA with SP_VVA, and SP_AAV with SP_VVV. The comprehensive calculations are shown below. For different hypotheses, different TP and SP values are used.

For Blocks 1 and 2:

$${P1}_{A}={{TP}}_{A\to A}$$

(12)

$${P2}_{A}={{SP}}_{{AAA}}* \left({{{\rm{|}}}}1-{{TP}}_{A\to A}{{{\rm{|}}}}\right)+\left(1-{{SP}}_{{AAA}}\right)* {{TP}}_{A\to A}$$

(13)

$${P1}_{V}={{TP}}_{A\to V}$$

(14)

$${P2}_{V}={{SP}}_{{AAV}}* \left(\left|1-{{TP}}_{A\to V}\right|\right)+\left(1-{{SP}}_{{AAV}}\right)* {{TP}}_{A\to V}$$

(15)

For Blocks 3 and 4:

$${P1}_{A}={{TP}}_{V\to A}$$

(16)

$${P2}_{A}={{SP}}_{{VVA}}* \left({{{\rm{|}}}}1-{{TP}}_{V\to A}{{{\rm{|}}}}\right)+\left(1-{{SP}}_{{VVA}}\right)* {{TP}}_{V\to A}$$

(17)

$${P1}_{V}={{TP}}_{V\to V}$$

(18)

$${P2}_{V}={{SP}}_{{VVV}}* \left({{{\rm{|}}}}1-{{TP}}_{V\to V}{{{\rm{|}}}}\right)+\left(1-{{SP}}_{{VVV}}\right)* {{TP}}_{V\to V}$$

(19)

Once, the predictions are determined and the prediction errors are calculated as:

$${{PE}1}_{A}={{{{\rm{|}}}}S}_{A}-{P1}_{A}{{{\rm{|}}}}$$

(20)

$${{PE}2}_{A}={{{{\rm{|}}}}{PE}1}_{A}-{P2}_{A}{{{\rm{|}}}}$$

(21)

$${{PE}1}_{V}={{{{\rm{|}}}}S}_{V}-{P1}_{V}{{{\rm{|}}}}$$

(22)

$${{PE}2}_{V}={{{{\rm{|}}}}{PE}1}_{V}-{P2}_{V}{{{\rm{|}}}}$$

(23)

EEG acquisition and preprocessing

EEG was recorded by a 64-channel 10–20 system Quick Cap and Neuroscan SynAmps RT amplifier (Compumedics Neuroscan USA, Ltd. Charlotte, NC, USA). The impedances were kept below 10 kΩ for the two eye electrodes and below 5 kΩ for the rest of the electrodes. The signals were collected with an online reference electrode near Cz, a 500 Hz sampling rate, and an online band-pass filter between 0.01 and 100 Hz. Event codes were sent at the onset of the first stimulus in a sequence. We used EEGLAB (Swartz Center for Computational Neuroscience)⁵⁷ for data preprocessing including extracting epochs from −1 to +1.9 s relative to the onset of the first stimuli (function: pop_epoch.m), and manually removing bad epochs when there was an obvious movement artifact or 80% of amplitudes were over ±100 microvolts. Then, components containing muscular and eye artifacts were removed using independent component analysis (ICA) (function: pop_ica.m) with ADJUST toolbox⁵⁸ (function: interface_ADJ.m). Current source density (CSD) analysis was conducted using CSD Toolbox to reduce the impact of volume conduction (function: CSD.m)⁵⁹. For CSD calculation, the montage of spherical coordinates was created using the channel information of the Quick Cap without channels CB1 and CB2 (function: Get_GH.m). The smoothing constant lambda was set as 1e–5, and the head radius was set as 10 cm.

Event-related spectral perturbation (ERSP)

We used Fieldtrip Toolbox⁶⁰ to quantify the time-frequency representation (from 1 to 100 Hz) of the preprocessed EEG for each trial, channel, and participant. The Morlet wavelet transformation was used with the wavelet cycles of 7 (function: ft_freqanalysis.m). To measure the ERSP for each trial, channel, and participant, we performed the baseline normalization by calculating the decibel with the baseline period from –0.2 to 0 s (time zero as the onset of the first stimulus) (function: ft_freqbaseline.m).

To make the left hemisphere the contralateral side for both left-side and right-side stimulation delivery, for trials where stimulation was applied to the left side, the ERSPs’ channel assignment was mirrored from left to right, and for trials where stimulation was applied to the right side, the channel layout was unchanged. Then, the ERSPs were then averaged across trials for each trial type in each block, and for each participant and each channel.

Significant contrast responses

Two types of contrasts were performed. First, to validate audiovisual integration (Fig. 2), we contrasted the averaged ERSP of AAA between Blocks 1 and 2 and the averaged ERSP of VVV between Blocks 3 and 4, for each participant and each channel. The significance across participants for each contrast and each channel was estimated using bootstrap analysis (function: bootci.m). For the topography, for each contrast, channel, and participant, the contrast responses were averaged across time points between 0 and 0.9 s after the last stimulus onset and across bins within the four frequency bands. The significance across participants was estimated using permutation analysis with the cluster correction set as at least two channels with close distance (function: ft_timelockstatisitcs.m). Second, to extract prediction-error signals (Fig. 4) for model-fitting, we contrasted the averaged ERSP of AAV to AAA in Blocks 1 and 2 and the averaged ERSP of VVA to VVV in Blocks 3 and 4, for each participant and each channel. For each contrast, channel, the significance across participants was estimated using bootstrap analysis.

Also, the signal-to-noise ratios in ERSPs may vary between the frequent and rare sequences due to the differing number of trials for each sequence type. To evaluate this potential bias, we conducted the same analysis with equalized trial numbers for the frequent and rare sequences. Specifically, we randomly selected trials from the frequent sequence to match the number of the rare sequence. For instance, in Block 1, which consisted of a total of 40 AAA trials, 40 AAV trials were randomly selected from a pool of 200 trials. Finally, the ERSPs were then averaged across the same trials for each trial type in each block, participant, and channel. The new contrast responses (Supplementary Fig. 7) and the extracted components (Supplementary Fig. 8) exhibited similar patterns to the results obtained from the original method (Supplementary Fig. 1 and Fig. 6).

PARAFAC analysis

PARAFAC analysis is a generalization of the principal component analysis to higher-order arrays. This method decomposes a tensor into a sum of outer products of vectors, which helps in identifying underlying patterns or structures in the multi-dimensional data. We performed the analysis using the N-way toolbox⁶¹ (function: parafac.m). The input was the significant contrast responses (Block × Channel × Time-Frequency). The option of “Fixmode” was applied to the Block dimension with the modeled contrast values while no constraint was applied on dimensions of Channel and Time-Frequency. The convergence criterion was set to be 1e − 6. We repeated the decomposition 50 times, each time with a set of random orthogonalized values for initialization. Each component decomposed from PARAFAC can be described by score or loading matrices (activation values) in the original dimensions: 4 model contrast values in the Block dimension, 60 activation values in the Channel dimension, and 450*100 activation values in the Time-Frequency dimension. The basic concept of PARAFAC is illustrated in Supplementary Fig. 9. Furthermore, we utilized two surrogate datasets to illustrate our model-fitting approach using PARAFAC, as depicted in Supplementary Fig. 10.

To evaluate the decomposition a CORCONDIA of 80~90% is considered a good consistency, and values below 50% indicate a problematic model with high interaction between components^28,30. On the other hand, BIC was calculated to correct the residual sum of squares (RSS) with the complexity of the model:

$${BIC}=u* \log \left(\frac{{RSS}}{u}\right)+w* \log \left(u\right)$$

(24)

Where u represents the size of inputs (e.g., 4 blocks × 60 channels × 450*100 time-frequency) and w represents the size of estimated outputs. See the values of w for the three models in Supplementary Table 1.

From 50 decompositions (each with random initialization), we acquired 50 CORCONDIA values and 50 BIC values. These values were compared between models using two-sample t test, and alpha was set as 0.05 with Bonferroni correction (function: ttest2.m).

Response reconstruction

We reconstructed the responses from 5 extracted components (i.e., S_A, S_V, PE1_A, PE1_V, and PE2_A/V) with dimensions of Block, Channel, and Time-Frequency. The response for Block i and Channel j was reconstructed (denoted as Recon) as follow:

$${Recon}\left(i,j\right)=\mathop {\sum }\limits_{c=1}^{5}({{Block}}_{c}\left(i\right)* {{{Channel}}_{c}\left(j\right)* {TimeFrequency}}_{c})$$

(25)

Where c represents the component index, where c = 1 (S_A), 2 (S_V), 3 (PE1_A), 4 (PE1_V), and 5 (PE2_A/V). For component c, Block_c represents the 4 model contrast values, Channel_c represents the 60 activation values in the Channel dimension, and TimeFrequency_c represents the 450*100 activation values in the Time-Frequency dimension.

Connectivity analysis

For the spectral functional connectivity, we performed PPC with Fieldtrip Toolbox (function: ft_connectivityanalysis.m). We paired 25 connections between PE1_A and PE2_A/V, and 24 connections between PE1_V and PE2_A/V because of one shared channel. We then segmented EEG responses in each block into 11 learning stages using a moving average approach. N is the number of trials in a phase of a block after bad epochs were excluded, and we created three segments (N/3 trials) for one phase and thus twelve segments in total for one block. Then, we used a window width of two segments and a step size of one segment to have 11 learning stages. For example, in a clean block (no bad trials, N = 30 from 120/4), and a segment has 10 trials (from 30/3). The first stage would include trials from 1^st to 20^th (first and second segments), and the second stage would include trials from 11^th to 30^th (second and third segments). For each stage, the trials from the four blocks were then pooled together.

PPC was evaluated at 450 time points and 100 frequency bins (1~100 Hz). These PPC values were then averaged across frequency bins for 4 frequency bands (theta: 4–8 Hz, alpha: 8–13 Hz, beta: 13–30 Hz, and gamma: 30–100 Hz), across time points (0–0.9 s relative to the last stimulus), and across connections (25 or 24 between two components). After average, there were 30 averaged PPC values for each time window, each pair of connected components, and each frequency band. We compared the value between each two windows within each participant for each pair using bootstrap analysis.

Statistics and reproducibility

The sample size in the current study (N = 30) was determined based on previous EEG/MEG studies with similar designs (N = 29 in Bekinschtein et al.¹³; N = 10 in Wacongne et al.¹⁷). For EEG analysis, we followed the general guideline to subjectively remove data containing bad artifacts, and an average of ~2.2% of the total 1920 trials for each participant was excluded (mean: 43.5 ± 38.3 trials). For statistical comparisons and data fitting analysis, the details including dimensional information of EEG datasets, the number of resampling, the correction for statistical threshold, key parameters, and toolbox functions were provided.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The raw EEG data, processed data, and source data underlying main and Supplementary Figs. have been deposited in Open Science Framework (OSF)⁶².

Code availability

The code to calculate model values of predictions and prediction errors based on the three hypotheses has been deposited in OSF⁶².

References

Clark, A. Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 36, 181–204 (2013).
Article PubMed Google Scholar
Friston, K., Thornton, C. & Clark, A. Free-Energy Minimization and the Dark-Room Problem. Front. Psychol. 3, 130 (2012).
Article PubMed PubMed Central Google Scholar
den Ouden, H. E. M., Kok, P. & de Lange, F. P. How Prediction Errors Shape Perception, Attention, and Motivation. Front. Psychol. 3, 548 (2012).
Google Scholar
Friston, K. Does predictive coding have a future? Nat. Neurosci. 21, 1019–1021 (2018).
Article CAS PubMed Google Scholar
Knill, D. C. & Pouget, A. The Bayesian brain: the role of uncertainty in neural coding and computation. Trends Neurosci. 27, 712–719 (2004).
Article CAS PubMed Google Scholar
Näätänen, R. The Mismatch Negativity: A Powerful Tool for Cognitive Neuroscience. Ear Hear. 16, 6–18 (1995).
Article PubMed Google Scholar
Näätänen, R., Paavilainen, P., Rinne, T. & Alho, K. The mismatch negativity (MMN) in basic research of central auditory processing: A review. Clin. Neurophysiol. 118, 2544–2590 (2007).
Article PubMed Google Scholar
Wacongne, C., Changeux, J. P. & Dehaene, S. A neuronal model of predictive coding accounting for the mismatch negativity. J. Neurosci. 32, 3665–3678 (2012).
Article CAS PubMed PubMed Central Google Scholar
Heslenfeld, D. J. Visual Mismatch Negativity. In Detection of Change: Event-Related Potential and fMRI Findings (ed. Polich, J.) 41–59 (Springer US, 2003). https://doi.org/10.1007/978-1-4615-0294-4_3.
Stefanics, G., Astikainen, P. & Czigler, I. Visual mismatch negativity (vMMN): a prediction error signal in the visual modality. Front. Hum. Neurosci. 8, 1074 (2015).
Article PubMed PubMed Central Google Scholar
Shen, G., Smyk, N. J., Meltzoff, A. N. & Marshall, P. J. Using somatosensory mismatch responses as a window into somatotopic processing of tactile stimulation. Psychophysiology 55, e13030 (2018).
Article PubMed Google Scholar
Shinozaki, N., Yabe, H., Sutoh, T., Hiruma, T. & Kaneko, S. Somatosensory automatic responses to deviant stimuli. Brain Res. Cogn. Brain Res. 7, 165–171 (1998).
Article CAS PubMed Google Scholar
Bekinschtein, T. A. et al. Neural signature of the conscious processing of auditory regularities. Proc. Natl. Acad. Sci. 106, 1672–1677 (2009).
Article CAS PubMed PubMed Central Google Scholar
Blundon, E. G., Rumak, S. P. & Ward, L. M. Sequential search asymmetry: Behavioral and psychophysiological evidence from a dual oddball task. PLOS One 12, e0173237 (2017).
Article PubMed PubMed Central Google Scholar
Chao, Z. C., Takaura, K., Wang, L., Fujii, N. & Dehaene, S. Large-Scale Cortical Networks for Hierarchical Prediction and Prediction Error in the Primate Brain. Neuron 100, 1252–1266.e3 (2018).
Article CAS PubMed Google Scholar
El Karoui, I. et al. Event-Related Potential, Time-frequency, and Functional Connectivity Facets of Local and Global Auditory Novelty Processing: An Intracranial Study in Humans. Cereb. Cortex 25, 4203–4212 (2015).
Article PubMed Google Scholar
Wacongne, C. et al. Evidence for a hierarchy of predictions and prediction errors in human cortex. Proc. Natl. Acad. Sci. 108, 20754–20759 (2011).
Article CAS PubMed PubMed Central Google Scholar
Chao, Z. C., Huang, Y. T. & Wu, C. T. A quantitative model reveals a frequency ordering of prediction and prediction-error signals in the human brain. Commun. Biol. 5, 1–18 (2022).
Article Google Scholar
van Wassenhove, V., Grant, K. W. & Poeppel, D. Visual speech speeds up the neural processing of auditory speech. Proc. Natl. Acad. Sci. 102, 1181–1186 (2005).
Article PubMed PubMed Central Google Scholar
Vroomen, J. & Stekelenburg, J. J. Visual Anticipatory Information Modulates Multisensory Interactions of Artificial Audiovisual Stimuli. J. Cogn. Neurosci. 22, 1583–1596 (2010).
Article PubMed Google Scholar
Colin, C. et al. Mismatch negativity evoked by the McGurk–MacDonald effect: a phonetic representation within short-term memory. Clin. Neurophysiol. 113, 495–506 (2002).
Article CAS PubMed Google Scholar
Olasagasti, I., Bouton, S. & Giraud, A. L. Prediction across sensory modalities: A neurocomputational model of the McGurk effect. Cortex 68, 61–75 (2015).
Article PubMed Google Scholar
Driver, J. & Noesselt, T. Multisensory interplay reveals crossmodal influences on ‘sensory-specific’ brain regions, neural responses, and judgments. Neuron 57, 11–23 (2008).
Article CAS PubMed PubMed Central Google Scholar
Stein, B. E., Stanford, T. R. & Rowland, B. A. The Neural Basis of Multisensory Integration in the Midbrain: Its Organization and Maturation. Hear. Res. 258, 4–15 (2009).
Article PubMed PubMed Central Google Scholar
Mcgurk, H. & Macdonald, J. Hearing lips and seeing voices. Nature 264, 746–748 (1976).
Article CAS PubMed Google Scholar
Barraclough, N. E., Xiao, D., Baker, C. I., Oram, M. W. & Perrett, D. I. Integration of Visual and Auditory Information by Superior Temporal Sulcus Neurons Responsive to the Sight of Actions. J. Cogn. Neurosci. 17, 377–391 (2005).
Article PubMed Google Scholar
Schroeder, C. E. & Foxe, J. J. The timing and laminar profile of converging inputs to multisensory areas of the macaque neocortex. Brain Res. Cogn. Brain Res. 14, 187–198 (2002).
Article PubMed Google Scholar
Bro, R. & Kiers, H. A. A new efficient method for determining the number of components in PARAFAC models. J. Chemom. 17, 274–286 (2003).
Article CAS Google Scholar
Harshman, R. A. & Lundy, M. E. PARAFAC: Parallel factor analysis. Comput. Stat. Data Anal. 18, 39–72 (1994).
Article Google Scholar
Pouryazdian, S., Beheshti, S. & Krishnan, S. CANDECOMP/PARAFAC model order selection based on Reconstruction Error in the presence of Kronecker structured colored noise. Digit. Signal Process. 48, 12–26 (2016).
Article Google Scholar
Vinck, M., van Wingerden, M., Womelsdorf, T., Fries, P. & Pennartz, C. M. The pairwise phase consistency: A bias-free measure of rhythmic neuronal synchronization. NeuroImage 51, 112–122 (2010).
Article PubMed Google Scholar
Göschl, F., Friese, U., Daume, J., König, P. & Engel, A. K. Oscillatory signatures of crossmodal congruence effects: An EEG investigation employing a visuotactile pattern matching paradigm. NeuroImage 116, 177–186 (2015).
Article PubMed Google Scholar
Roa Romero, Y., Senkowski, D. & Keil, J. Early and late beta-band power reflect audiovisual perception in the McGurk illusion. J. Neurophysiol. 113, 2342–2350 (2015).
Article PubMed PubMed Central Google Scholar
Arnal, L. H. & Giraud, A. L. Cortical oscillations and sensory predictions. Trends Cogn. Sci. 16, 390–398 (2012).
Article PubMed Google Scholar
Jiang, Y. et al. Constructing the hierarchy of predictive auditory sequences in the marmoset brain. eLife 11, e74653 (2022).
Article CAS PubMed PubMed Central Google Scholar
Bastos, A. M. et al. Visual Areas Exert Feedforward and Feedback Influences through Distinct Frequency Channels. Neuron 85, 390–401 (2015).
Article CAS PubMed Google Scholar
Fries, P. A mechanism for cognitive dynamics: neuronal communication through neuronal coherence. Trends Cogn. Sci. 9, 474–480 (2005).
Article PubMed Google Scholar
Michalareas, G. et al. Alpha-Beta and Gamma Rhythms Subserve Feedback and Feedforward Influences among Human Visual Cortical Areas. Neuron 89, 384–397 (2016).
Article CAS PubMed PubMed Central Google Scholar
Bragin, A. et al. Gamma (40-100 Hz) oscillation in the hippocampus of the behaving rat. J. Neurosci. Off. J. Soc. Neurosci. 15, 47–60 (1995).
Article CAS Google Scholar
Canolty, R. T. et al. High gamma power is phase-locked to theta oscillations in human neocortex. Science 313, 1626–1628 (2006).
Article CAS PubMed PubMed Central Google Scholar
Sirota, A. et al. Entrainment of neocortical neurons and gamma oscillations by the hippocampal theta rhythm. Neuron 60, 683–697 (2008).
Article CAS PubMed PubMed Central Google Scholar
Cohen, M. X., Elger, C. E. & Fell, J. Oscillatory activity and phase-amplitude coupling in the human medial frontal cortex during decision making. J. Cogn. Neurosci. 21, 390–402 (2009).
Article PubMed Google Scholar
Besle, J., Bertrand, O. & Giard, M.-H. Electrophysiological (EEG, sEEG, MEG) evidence for multiple audiovisual interactions in the human auditory cortex. Hear. Res. 258, 143–151 (2009).
Article PubMed Google Scholar
Schelenz, P. D. et al. Multisensory integration of dynamic emotional faces and voices: method for simultaneous EEG-fMRI measurements. Front. Hum. Neurosci. 7, 729 (2013).
Article PubMed PubMed Central Google Scholar
Starke, J., Ball, F., Heinze, H. J. & Noesselt, T. The spatio-temporal profile of multisensory integration. Eur. J. Neurosci. 51, 1210–1223 (2020).
Article PubMed Google Scholar
Champoux, F. et al. A role for the inferior colliculus in multisensory speech integration. Neuroreport 17, 1607–1610 (2006).
Article PubMed Google Scholar
Meredith, M. A. & Stein, B. E. Interactions among converging sensory inputs in the superior colliculus. Science 221, 389–391 (1983).
Article CAS PubMed Google Scholar
Schroeder, C. E. et al. Somatosensory input to auditory association cortex in the macaque monkey. J. Neurophysiol. 85, 1322–1327 (2001).
Article CAS PubMed Google Scholar
O’Sullivan, A. E., Crosse, M. J., Liberto, G. M. D., de Cheveigné, A. & Lalor, E. C. Neurophysiological Indices of Audiovisual Speech Processing Reveal a Hierarchy of Multisensory Integration Effects. J. Neurosci. 41, 4991–5003 (2021).
Article PubMed PubMed Central Google Scholar
Dehaene, S., Meyniel, F., Wacongne, C., Wang, L. & Pallier, C. The Neural Representation of Sequences: From Transition Probabilities to Algebraic Patterns and Linguistic Trees. Neuron 88, 2–19 (2015).
Article CAS PubMed Google Scholar
Mathys, C. D. et al. Uncertainty in perception and the Hierarchical Gaussian Filter. Front. Hum. Neurosci. 8, 825 (2014).
Article PubMed PubMed Central Google Scholar
Iglesias, S. et al. Hierarchical Prediction Errors in Midbrain and Basal Forebrain during Sensory Learning. Neuron 80, 519–530 (2013).
Article CAS PubMed Google Scholar
Meyer, G. F. & Noppeney, U. Multisensory integration: from fundamental principles to translational research. Exp. Brain Res. 213, 163–166 (2011).
Article PubMed Google Scholar
Saalasti, S., Tiippana, K., Kätsyri, J. & Sams, M. The effect of visual spatial attention on audiovisual speech perception in adults with Asperger syndrome. Exp. Brain Res. 213, 283–290 (2011).
Article PubMed Google Scholar
Freiherr, J., Lundström, J. N., Habel, U. & Reetz, K. Multisensory integration mechanisms during aging. Front. Hum. Neurosci. 7, 863 (2013).
Article PubMed PubMed Central Google Scholar
Kleiner, M., Brainard, D. & Pelli, D. What’s new in psychtoolbox-3. Perception 36, 1–16 (2007).
Google Scholar
Delorme, A. & Makeig, S. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J. Neurosci. Methods 134, 9–21 (2004).
Article PubMed Google Scholar
Mognon, A., Jovicich, J., Bruzzone, L. & Buiatti, M. ADJUST: An automatic EEG artifact detector based on the joint use of spatial and temporal features. Psychophysiology 48, 229–240 (2011).
Article PubMed Google Scholar
Kayser, J. & Tenke, C. E. Principal components analysis of Laplacian waveforms as a generic method for identifying ERP generator patterns: I. Evaluation with auditory oddball tasks. Clin. Neurophysiol. 117, 348–368 (2006).
Article PubMed Google Scholar
Oostenveld, R., Fries, P., Maris, E. & Schoffelen, J. M. FieldTrip: Open Source Software for Advanced Analysis of MEG, EEG, and Invasive Electrophysiological Data. Comput. Intell. Neurosci. 2011, 156869 (2010).
PubMed PubMed Central Google Scholar
Andersson, C. A. & Bro, R. The N-way toolbox for MATLAB. Chemom. Intell. Lab. Syst. 52, 1–4 (2000).
Article CAS Google Scholar
Huang, Y. T. et al. Crossmodal Hierarchical Predictive Coding for Audiovisual Sequences in the Human Brain. Open Science Framework. https://doi.org/10.17605/OSF.IO/4S2E3 (2024).

Download references

Acknowledgements

We thank Kai-Jie Liang and Shih-Yao Mao for helping with participant recruitment and experiment preparation, and Felix B. Kern for proofreading. We also thank Hui-Fen Mao and Shih-Yao Mao for their help in managing IRB, and Shih-Yao Mao. This work was supported by World Premier International Research Center Initiative (WPI), MEXT, Japan (to Z.C.C.), and the Ministry of Science and Technology of Taiwan (MOST 109-2410-H-002-106-MY3) (to C.W. and H.M.).

Author information

Authors and Affiliations

International Research Center for Neurointelligence (WPI-IRCN), UTIAS, The University of Tokyo, Tokyo, Japan
Yiyuan Teresa Huang, Chien-Te Wu, Shinsuke Koike & Zenas C. Chao
Department of Multidisciplinary Sciences, Graduate School of Arts and Sciences, The University of Tokyo, Tokyo, Japan
Yiyuan Teresa Huang & Shinsuke Koike
School of Occupational Therapy, College of Medicine, National Taiwan University, Taipei, Taiwan
Chien-Te Wu, Yi-Xin Miranda Fang & Chin-Kun Fu
University of Tokyo Institute for Diversity & Adaptation of Human Mind (UTIDAHM), Tokyo, Japan
Shinsuke Koike

Authors

Yiyuan Teresa Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chien-Te Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Xin Miranda Fang
View author publications
You can also search for this author in PubMed Google Scholar
Chin-Kun Fu
View author publications
You can also search for this author in PubMed Google Scholar
Shinsuke Koike
View author publications
You can also search for this author in PubMed Google Scholar
Zenas C. Chao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.C.C. conceptualized the study. Y.T.H., C.W., and Z.C.C. designed the experimental protocol, and Y.T.H., Y.M.F., and C.F. collected the data (Y.M.F. and C.F. contributed equally). Y.T.H. and Z.C.C. proposed the theoretical models and performed the data analyses. Y.T.H wrote the manuscript, and Z.C.C., C.W., and S.K. helped with editing. All authors contributed to and have approved the final paper.

Corresponding author

Correspondence to Zenas C. Chao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Jordan Hamm and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Julio Hechavarria and Joao Valente. [A peer review file is available].

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Supplementary Information

Reporting summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Huang, Y.T., Wu, CT., Fang, YX.M. et al. Crossmodal hierarchical predictive coding for audiovisual sequences in the human brain. Commun Biol 7, 965 (2024). https://doi.org/10.1038/s42003-024-06677-6

Download citation

Received: 01 December 2023
Accepted: 02 August 2024
Published: 09 August 2024
DOI: https://doi.org/10.1038/s42003-024-06677-6
Springer Nature Limited

Crossmodal hierarchical predictive coding for audiovisual sequences in the human brain

Abstract

Similar content being viewed by others

A quantitative model reveals a frequency ordering of prediction and prediction-error signals in the human brain

Spatiotemporal brain hierarchies of auditory memory recognition and predictive coding

Perceptual learning shapes multisensory causal inference via two distinct mechanisms

Introduction

Results

A crossmodal audiovisual local-global oddball paradigm

Univariate analysis of neural responses for audiovisual integration

Three hypotheses of crossmodal transitions across hierarchy

Quantitative models for evaluating the hypotheses

Theoretical values of prediction and prediction-error signals

Model evaluation with EEG data

Hierarchical prediction-error signals extracted from Model 3

Crossmodal integration during learning

Discussion

Methods

Participants

Stimuli and task

Probability calculation

Model value calculation

EEG acquisition and preprocessing

Event-related spectral perturbation (ERSP)

Significant contrast responses

PARAFAC analysis

Response reconstruction

Connectivity analysis

Statistics and reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Peer Review File

Supplementary Information

Reporting summary

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation