Introduction

Reading difficulty (RD) is a common neurodevelopmental disorder with a prevalence of 5–17% across language systems1,2,3, which is characterized by difficulties in accurate and fluent reading, despite appropriate cognitive ability and instruction4. Previous studies have shown that challenges in acquiring proficient reading and spelling skills persist into adulthood5,6,7. Deficits in phonological processing have been well-documented as a signature of RD, which is reflected in speech processing in early life8. Abnormal brain activities during speech processing might serve as markers of RD and facilitate early diagnosis because brain responses to speech are automatic which can even be collected in newborns. A recent longitudinal study9 has shown that RD can be successfully predicted through lower functional connectivity between the left primary auditory cortex and the left planum temporale at the age of 5. Several studies have documented the predictive relationship between early childhood speech perception and later literacy skills10,11,12,13,14.

Phonological deficits could be reflected in both speech perception and production in early life15,16; however, relatively fewer studies have concerned speech production in RD, presumably because it is challenging to collect evidence in infants/young children for speech production. An alternative is to examine foreign speech perception and production in older children and adults with RD and to test whether brain activity patterns during these tasks can be reliable markers of RD, which would also facilitate the diagnosis of RD.

A substantial amount of research has documented deficits in speech perception in individuals with RD, for example, a significantly lower discrimination score between minimum contrast syllables (e.g., /ba/-/da/) than age controls and reading controls17,18,19. Individuals with RD were found to have less sharply defined phoneme boundaries in the categorical perception of the synthetic syllable continuum20.

Neuroimaging studies have also provided evidence for speech perception deficits in individuals with RD. For example, reduced brain activation in the left prefrontal cortex was found in children with RD compared to control children during auditory rhyming judgment in English21, and reduced activation was also found in the superior temporal cortex during auditory speech processing22,23,24,25. An fMRI study with multivoxel pattern analysis showed less distinct activation patterns in the bilateral superior temporal regions for /bA/ and /dA/ in beginning readers with a high familial risk of RD26, suggesting a low quality of phonemic representations.

Evidence of speech perception deficits in RD also comes from EEG research. For example, smaller amplitude and longer latency of mismatch negativity/response (MMN/MMR) and/or late discriminative negativity (LDN) elicited by deviant stimuli have been observed over frontocentral sites in children and adults with RD27, and in 20-month-old infants who have high risk of RD28, suggesting reduced phoneme discrimination. Gu and Bi29, in their review and meta-analysis, reported persistent speech perception deficits (reduced MMN amplitude) in individuals with RD in alphabetic languages with an even larger effect size in adults than in children with RD. It suggests that speech perception deficits do not disappear with age in RD, which is consistent with the finding that adults with RD continue to show phonological deficits even though they may have improved word decoding accuracy30.

In Chinese RD, speech perception deficits have also been documented, such as reduced brain activation in the dorsal left IFG during an auditory rhyming judgment task31, and reduced MMN elicited by deviant stimuli in the frontal sites32 in children with RD compared to control children. In a Mandarin Chinese tone categorical perception study, it was also found that children without RD but not children with RD showed greater MMN peak amplitude over frontal sites for cross-category deviants than within-category deviants33, suggesting reduced categorical perception in Chinese children with RD compared to children without RD. However, no studies have concerned differences between adults and children in speech perception deficits in Chinese RD.

Speech perception and production are tightly connected because the same phonological representations are involved in the two processes34. The poorly specified phonological representation in individuals with RD makes it challenging to formulate a speech-motor plan during production35; therefore, resulting in a slower speaking rate and increasing pauses in individuals with RD36. In a longitudinal study, toddlers who were later identified as having RD at school age showed slower speaking rates, longer pauses and reduced production of syllables per speaking turn at ages 2 and 3 compared to not-at-risk peers35, suggesting early difficulties in articulation planning for multi-syllabic utterances in children with RD. As children with RD reach school age, their difficulties with multi-syllabic words persist. They continue to struggle with repeating multi-syllabic words and nonwords37, or producing the names of multi-syllabic items38 compared to both age-matched control children and reading-matched control children.

Remarkably, the deficits in multi-syllabic production extend into adulthood. Adults with RD demonstrate slower rates and more errors than controls when repeating multi-syllabic phrases rapidly39. Other than deficits in multi-syllabic production, it has also been found that information about articulatory movements for specific phonemes is less accessible in adults with RD40, because they showed difficulty in matching phonemes with drawings of the articulators positions when making a specific sound. However, one study found no deficits in articulatory speed in adults with RD, despite deficits in phonemic awareness41, whereas another more recent study found slower articulatory rate in adults with RD, regardless of their comorbidity with motor coordination disorder42.

In summary, previous studies have focused on speed, pauses, and errors in speech production and more sophisticated measures of speech quality such as the voice onset time (VOT) of consonants and the frequency formants of vowels have not been used. VOTs and vowel frequency formants are objective acoustic analyses that quantitatively describe the quality of speech sounds. VOT refers to the time between the stop burst and the onset of the voicing for a consonant. A vowel frequency formant is a concentration of acoustic energy around a particular frequency in the speech wave. According to Marchetti et al.42, each vowel has three formants. The first formant (F1) is inversely related to vowel height. The second formant (F2) is related to the degree of backness of a vowel. We distinguish one vowel from another by the differences in these formants. These acoustic analyses of speech sound would provide a more accurate and detailed measure of phonological deficits associated with RD. Furthermore, no studies have examined brain activity patterns during speech production in individuals with RD.

Deficits have also been found in individuals with RD during foreign speech perception and production. Ylinen et al.43 found weaker MMN amplitude over the right temporal sites in RD children than in children without RD for second-language words but not for native words, suggesting specific deficits in perceiving second-language speech. Soroli & Ramus44 found that French adults with RD showed deficient foreign lexical stress discrimination but normal foreign stress and plosive production. However, in Bouhon et al.’s45 study, French adults with RD showed difficulties in producing English vowel contrasts. Specifically, the duration difference between /i:/-/ɪ/ was smaller than in adults without RD. Taken together, there might be specific deficits in foreign speech perception and production that are not evident in native languages.

One important factor that needs to be taken into account in foreign speech perception and production is the language distance to the native language. The phonological similarities between native and foreign languages have been reported to influence foreign speech production46,47. When the foreign speech sounds are very contrastive to the native speech sounds, it may present a greater challenge for perception and production. Therefore, in the current study, we chose to use Spanish which is contrastive to Chinese in speech sounds, so that we may have a greater chance to identify deficits in RD.

One research gap in the literature is whether and how phonological deficits in RD are reflected in foreign speech perception and production, and how it differs in children and adults with RD. In the current study, we examined foreign speech production quality and brain activities during foreign speech perception and production in Chinese children and adults with RD using fNIRS. fNIRS is especially suitable for studying speech production, because of its relative tolerance with motion artifacts. For the measures of VOT and vowel formants, we expected children to be more similar to a native Spanish speaker than adults, and readers without RD to be more similar than readers with RD. For the fNIRS data, we expected that individuals with RD show different brain activation patterns compared to individuals without RD during foreign speech perception and production, especially in the fronto-tempo-parietal regions. Moreover, we expected different age effects in individuals with RD compared to those without RD. For example, in some regions, there may be greater differences between adults and children in readers without RD than those with RD and vice versa. Finally, using a machine learning approach, we examined whether brain activity patterns during speech perception and production could serve as reliable markers of RD. We expected activation patterns in the fronto-tempo-parietal regions to be strong predictors of RD classification.

Results

Behavioral assessments

We compared chronological age and Raven score between RD readers and readers without RD separately for children and adults. No significant difference was found in chronological age between children with RD and children without RD (t(41) = −2.01, p = 0 .072), neither between adults with RD and adults without RD (t(40) = 0 .05, p = 0 .958). There was no significant difference in Raven scores between children with RD and children without RD (t(38) = 1.81, p = 0.078), while adults with RD scored lower on Raven than adults without RD (t(27.48) = 4.91, p < 0.001). We did sub-group analyses with matched Raven in RD adults (N = 14) and adults without RD (N = 12) for the behavioral tests and fNIRS data, and we found similar results as for the whole sample. Therefore, we report results from the whole sample. Results from the sub-group analyses are reported in the supplementary results and Supplementary Table 1. Furthermore, we added Raven as a covariate in all statistics. Table 1 presents the results of the behavioral tests and more detailed reports on the behavioral tests are presented in the supplementary results and Supplementary Figs. 3 & 4.

Table 1 Demographic information and results on behavioral tests

Phonetic analyses for frequency formants

Figure 1 shows normalized vowel charts for the 5 vowels in each group. The vowel chart of children without RD showed relatively clear boundaries and less overlap across the vowels. The other groups of participants showed a considerable degree of overlap across the vowels. A Group by Age ANCOVA with Raven as a covariate revealed a significant main effect of Age (F(1,84) = 12.10, pFDR = 0.003, ηp2 = 0.13) for the averaged distance to the native speaker across all 5 vowels. Children were more similar to the native speakers than adults. The main effect of Group (F(1,84) = 2.20, pFDR = 0.430, ηp2 = 0.03) and the Age by Group interaction were not significant (F(1,84) = 2.22, pFDR = 0.140, ηp2 = 0.03). In order to understand whether the main effect of age was driven by individuals without RD, individuals with RD, or both, we conducted a simple effect analysis. We found that children without RD had a greater similarity to the model speaker than adults without RD (t(25.25) = −3.66, p = 0.001), but there was no difference between RD children and RD adults (t(38) = −1.52, p = 0.14), suggesting that the main effect of age was mainly driven by readers without RD. Furthermore, simple effect analysis also revealed that children without RD had a greater similarity to the native speaker than children with RD in the vowels’ frequency formants across all five vowels (t(40) = −4.69, p < 0.001, Fig. 2a), and no significant difference was found between adults with RD and adults without RD (t(40) = −0.29, p = 0.77).

Fig. 1: Normalized vowel charts for children without RD (upper left), children with RD (upper right), adults without RD (lower left), and adults with RD (lower right).
figure 1

The bigger dots in each graph are normalized vowel formant frequencies for the native model speaker while the small dots represent individual participants.

Fig. 2: Results of vowels’ frequency formants and consonants’ VOT in each group.
figure 2

Vowels’ frequency formants are presented in (a) and consonants’ VOT results are presented in (b). No group differences were found for VOTs. CAC age-control children, CRD children with RD, AAC age-control adults, ARD adults with RD. Error bars depict SEM. *p < 0.05, **p < 0.01.

Phonetic analyses for VOTs

We analyzed the voice onset time for /b/ and /d/ in the 5 Spanish words (i.e., dificil, dado, brazo, bueno, bebe). We found that every group showed a significant difference from the model speaker for both /b/ and /d/ (Fig. 2b); however, neither the main effect of Age (F(1,82) = 1.10, pFDR = 0.30, ηp2 = 0.01 for /b/, F(1,65) = 3.24, pFDR = 0.12, ηp2 = 0.05 for /d/) nor the main effect of Group (F(1,82) = 0.03, pFDR = 0.97, ηp2 < 0.001 for /b/, F(1,65) = 0.002, pFDR = 0.97, ηp2 < 0.001 for /d/) reached statistical significance. Furthermore, the interaction effect between Age and Group was not significant either (F(1,82) = 4.23, pFDR = 0.093, ηp2 = 0.05 for /b/, F(1,65) = 3.65, pFDR = 0.093, ηp2 = 0.06 for /d/).

fNIRS general linear model (GLM) results for the speech perception task

We conducted a repeated-measure ANCOVA of group (RD, AC) by age (children, adults) by similarity to Chinese (high similarity, low similarity) by syllable consistency (identical, different) with Raven score as a covariate for each channel. Results showed no main effect of age, group, similarity, or syllable consistency. A significant three-way (age × group × similarity) interaction was found in CH31 (right IFG, F(1,80) = 10.07, pFDR = 0.048, ηp2 = 0.11) and CH37 (right DLPFC, F(1,80) = 16.16, pFDR = 0.006, ηp2 = 0.17). Simple effect analysis showed that adults with RD had a greater hemodynamic response than adults without RD in perception of Spanish syllables with high similarity to Chinese (F(1,80) = 6.92, p = 0.010 for CH31, F(1,80) = 6.92, p = 0.010 for CH37), while no significant differences were found between children without RD and children with RD. No group differences were found for Spanish syllables with low similarity to Chinese (Fig. 3a, b). Another way to explain the interaction is a greater decrease in activation from children without RD to adults without RD than that from children with RD to adults with RD for the perception of Spanish syllables with high similarity to Chinese but not for Spanish syllables with low similarity to Chinese (F(1,80) = 6.05, p = 0.016 for CH31, F(1,80) = 6.66, p = 0.012 for CH37).

Fig. 3: Results from the GLM analysis of the fNIRS tasks.
figure 3

a, b Significant three-way interaction for CH31 and CH37 during the speech perception task. CAC age-control children, CRD children with RD, AAC age-control adults, and ARD adults with RD. c Children with RD showed a significant negative correlation between pseudoword rhyming and activation in CH37 (right DLPFC) during perception of Spanish syllables with high similarity to Chinese. d Significant interaction between language and group for CH22 during the speech production task. *pFDR < 0.05, **pFDR < 0.01.

fNIRS general linear model (GLM) results for the speech production task

An ANCOVA of group (RD, AC) by age (children, adults) by language (Chinese, Spanish) was conducted with Raven as a covariate for each channel’s data in the speech production task. We found no significant main effects of Age, Group, or Language, as well as no interactions between Age and Group. However, a significant Language by Group interaction was observed in channel 22 (left MTG, F(1,74) = 11.53, pFDR = 0.048, ηp2 = 0.14) (Fig. 3d). Simple effect analysis indicated that individuals without RD exhibited greater deactivation for Spanish than for Chinese (F(1,74) = 8.12, p = 0.006), whereas individuals with RD showed greater deactivation for Chinese than for Spanish (F(1,74) = 5.64, p = 0.020). The interaction could also be explained by the fact that individuals with RD had reduced deactivation compared to individuals without RD for Spanish production (F(1,74) = 7.10.25, p = 0.002), while no significant group difference was observed for Chinese production (F(1,74) = 0.17, p = 0.68).

Brain-behavioral correlations

In order to understand how abnormal brain activation is correlated with phonological deficits, we conducted Pearson’s correlations between the activation of channels exhibiting a significant interaction in each task (beta values of CH31 and CH37 for the high similarity to Chinese condition in the speech perception task and beta values of CH22 in the speech production task) and phonological awareness for children with RD and adults with RD separately. A negative correlation was found between the activation of CH37 during perception of Spanish syllables with high similarity to Chinese and pseudoword rhyming judgment (r = −0.629, pFDR = 0.084) for children with RD (Fig. 3c). No significant correlation was found between activation and phonological awareness in adults with RD. Steiger’s Z test revealed a significant difference between children with RD and adults with RD in the correlation between activation of CH37 during Spanish perception and pseudoword rhyming judgment (z = 2.90, p = 0.004).

Classification performance

Using LOOCV, the SVM classifier yielded an accuracy ranging from 60% to 90% when classifying children, adults, or all participants across the two fNIRS tasks (Table 2). For classifications with higher accuracy than the permutation tests, we listed the p-value. The null hypothesis distribution obtained from permutation testing is displayed in Supplementary Fig. 5.

Table 2 Accuracies of SVM across languages and tasks in the classification model

Brain regions with high discriminative power

For the five classifications that showed a significantly higher accuracy than the permutation tests (Table 2), we calculated the frequency of channels appearing in the optimum feature set during cross-validation. As Fig. 4a shows, several regions exhibited relatively large weights (appearing in at least 80% of the optimum feature sets of cross-validation folds) for classifying RD in children using the Spanish production task, including the left MTG (CH22) and the right supramarginal gyrus (CH29).

Fig. 4: Feature weight maps for RD classification.
figure 4

a In children using the Spanish production task, b in adults using the Spanish perception task with low similarity to Chinese, c in adults using the Chinese production task, d in adults using the Chinese production task and the Spanish perception task with high similarity to Chinese, and e in children and adults using the Chinese production task. The Figure shows the weight assigned to each feature (fNIRS channel) in all folds of the LOOCV. The diameter of the sphere at each feature reflects its relative weight, with larger diameters indicating greater weights.

For classifying RD in adults using the perception of Spanish syllables with low similarity to Chinese, Fig. 4b shows that the left DLPFC (CH2, CH13), left premotor and SMA (CH8), left MTG (CH22), right postcentral (CH34) and right IFG (CH31) exhibited relatively large weights.

For classifying RD in adults using the Chinese production task, Fig. 4c shows that the bilateral IFG (CH9, CH31), bilateral premotor and SMA (CH11, CH47), left MTG (CH24), right primary somatosensory cortex (CH33), right PMC (CH36, CH48) exhibited large weights.

In Fig. 4d for classifying RD in adults using the Chinese production task and the perception of Spanish syllables with high similarity to Chinese, activations that had large weights include the left IFG (CH9), left frontal eye fields (CH1), left PMC (CH10), right postcentral (CH27) left postcentral (CH11), left MTG (CH24), right DLPFC (CH37), right premotor and SMA (CH46 CH47 CH48).

When classifying RD in both children and adults using the Chinese production task, regions that showed large weights include the bilateral premotor and SMA (CH11, CH16, CH39, CH47), bilateral postcentral (CH19, CH27, CH36), bilateral MTG (CH24, CH25, CH28), left IFG (CH14), left supramarginal gyrus (CH18), and right IPL (CH41) (Fig. 4e).

Taken together, channels located in the left MTG were the most consistent region with high weights, because they appeared in all five classifications, and the left premotor cortex, left SMA, and the left IFG had high weights in three classifications.

Discussion

In the current study, we compared brain activation during speech perception and production in children and adults with or without RD. We found reduced Spanish pronunciation accuracy in children with RD compared to children without RD in the vowel frequency formants analysis. In the brain, we found reduced differences between adults and children in RD readers compared to readers without RD in the perception of Spanish syllables in the right inferior frontal gyrus (IFG) and right DLPFC, suggesting slowed development in these regions in individuals with RD. We also found reduced language differentiation between Chinese and Spanish in the left MTG in individuals with RD compared to individuals without RD in the production task. Moreover, using a machine learning approach, we found that brain activity patterns in the left MTG, left premotor, SMA, and left IFG during the speech tasks were the most reliable features for classifying individuals with RD. Our findings provide evidence for brain abnormalities associated with phonological deficits during foreign speech processing in RD from a developmental perspective. Discussion on the findings of behavioral assessments is presented in the supplementary discussion.

In the phonetic analysis, we found that children had better Spanish vowel pronunciation than adults, which is consistent with previous findings that children in general have advantages compared to adults in foreign speech imitation48,49. According to Yeni-Komshian et al.50, 11-year-old children achieved a better score in second-language pronunciation than adults, even though their scores were not as high as children younger than 6. Our study had exactly the same finding, suggesting that 11-year-old children still have advantages compared to adults in foreign speech learning.

We also found poorer vowel pronunciation in children with RD than children without RD but adults with RD did not differ significantly from adults without RD. This is because children without RD had a better performance than adults without RD but such an advantage was not found in children with RD compared to adults with RD. Our finding suggests that children with RD have less accurate foreign speech production than control children, presumably due to their phonological deficits. We speculate that phonological deficits in children with RD affect their learning to produce foreign speech sounds so that they do not show an age advantage compared to adults on the speech production task.

In the brain, for the speech perception task, we found that adults showed decreased activation compared to children in readers without RD but not in readers with RD in the right IFG and right DLPFC for Spanish syllables with high similarity to Chinese. Less involvement of these regions in adults than in children is generally interpreted as less effort in adults than in children51, especially when these syllables are similar to the native language. The role of the right IFG in phoneme discrimination has been repeatedly documented. For example, a study by Myers et al.52 found that the right IFG exhibited greater activation in discriminating both between-category and within-category trials than identical trials, suggesting its involvement in phoneme discrimination. Furthermore, Kovelman, Yip, and Beck53 found that the right IFG showed greater activation to deviant stimuli than standard stimuli in native and non-native phoneme discrimination, while the left IFG was specifically sensitive to native phonemes.

The dorsolateral prefrontal cortex (DLPFC) is primarily associated with working memory and other executive functions54. The less involvement of this region in adults than in children suggests less effort in working memory55, especially when the Spanish syllables were similar to Chinese. Furthermore, we found a negative correlation between brain activation in the right DLPFC and pseudoword rhyming judgment in RD children in the current study, further suggesting that RD children with better phonological awareness need less effort of working memory in this region for speech perception. However, there was a lack of difference between adults with RD and children with RD, suggesting reduced development in the speech network in RD readers compared to typical readers, probably because of the influence of their phonological deficits.

For the speech production task, RD readers showed reduced deactivation compared to readers without RD in the anterior part of the left MTG (i.e. channel 22) during Spanish but not Chinese production. Another way to interpret the interaction is that readers without RD showed greater deactivation in this region in Spanish than in Chinese production, but RD readers showed greater deactivation in Chinese than in Spanish production. The MTG is believed to be part of the default mode network56,57 and greater deactivation for Spanish than Chinese in readers without RD suggests greater challenge in doing the Spanish production task than the Chinese production task. Nonetheless, RD readers could not efficiently deactivate the default mode network for the more challenging foreign speech imitation to a greater degree. The default mode network has been found to play an important role in learning58,59, and its abnormality has been reported in developmental disorders, such as RD60,61,62, ASD63,64,65, and ADHD66,67,68. In a recent study, the default mode network was found to show the largest developmental changes in brain signal complexity for participants at 6–13 years of age, compared to five other networks, namely, the vision, motor, dorsal attention, ventral attention, and frontal-parietal network69. Compared to the early-developing vision and motor networks, late-developing networks such as the DMN have a longer developmental window and therefore might be more influenced by learning experiences and environment. Therefore, the abnormality in the default mode network might be due to the atypical learning experiences in individuals with developmental disorders.

Using MVPA, individuals with RD were distinguished from age-matched non-RD counterparts with relatively high accuracy based on brain activation patterns during the speech tasks, suggesting a reliable classifier of RD. We found that the fNIRS channels located in the left MTG left premotor cortex, left SMA, and left IFG consistently showed high discriminative power.

The left MTG includes an anterior channel 22, and a posterior channel 24 (Fig. 4). channel 22 is part of the DMN network as discussed above, whereas channel 24 is involved in speech perception and phonological representation. In the model by Hickok and Poeppel70, the posterior MTG supports the sound-based representation of speech. It has been found that the posterior MTG exhibits greater activation for speech perception in noise than in normal condition71, suggesting its importance in speech perception. Previous studies have also reported decreased activation of the left MTG in individuals with RD compared to individuals without RD across both speech and reading tasks72,73,74,75,76, suggesting deficient phonological representation and speech processing in individuals with RD. Consistent with previous studies, our results from a machine learning approach, further suggest that brain activation pattern in the left MTG during speech tasks is a reliable marker of RD.

The premotor cortex has been found to play an important role in speech-motor planning, phonological short-term memory, and sensorimotor integration77,78,79. The SMA has also been found to play a crucial role in planning complex motor sequences, which is essential for handling the “complex speech demands” in tasks with unfamiliar words and complex nonwords production80,81,82. The premotor cortex and SMA have been found to be involved in not only speech production but also speech perception83,84, because perception and action share a representation system85,86. Abnormalities in these two regions during speech tasks make them reliable features of RD. Consistent with our finding, previous studies have also shown reduced activations in the SMA and premotor regions in individuals with RD during phonological rhyming tasks and phonological short-term memory tasks87,88, suggesting that abnormal function of these regions might be associated with the phonological deficits in RD. Taken together, with a machine learning approach, we suggest a functional abnormality of these regions associated with phonological deficits during speech processing to be key features of Chinese RD.

The left IFG, where Broca’s area is housed, is also involved in speech production planning70,89. In addition to its role in speech production, the left IFG has been found to be involved in other phonological processing such as phonemic discrimination90, phonological working memory91,92, phonological competition and selection93,94. Previous research has consistently found abnormal structure and function of the left IFG associated with RD, especially in Chinese RD95. Our study using a machine learning method, also confirms that the abnormal activation patterns of the left IFG during speech tasks can be a reliable marker of RD in Chinese.

The relatively small sample size and cross-sectional design might have limited our capability of revealing developmental changes in the RD readers and typical readers in foreign speech perception and production. Moreover, future research is also needed to examine whether these findings are replicable in other languages.

To conclude, in the current study, we revealed neurological differences that are associated with phonological deficits reflected in low quality of foreign speech perception and production in individuals with RD. Moreover, we found that brain activation patterns in the left MTG, left premotor, SMA, and left IFG can serve as reliable classifiers of RD regardless of age and speech tasks. Our findings provide important evidence for abnormal foreign speech processing in RD from a developmental perspective.

Methods

Participants

We recruited fifth-grade children from public elementary schools, and students from associate degree colleges in the local city. Participants with RD met the following criteria: (1) the standard score on Raven was above 80; (2) the z-score was below −1.5 on at least one of three reading tests, namely, a Chinese character naming test, a Chinese sentence reading fluency test, and a one-minute Chinese character naming test. The inclusion criteria for participants without RD were: (1) the standard score on Raven was above 80; (2) the z-score was above −1 on all of the three reading tests. We had 20 children with RD (mean age = 11.00 years, range 10–12, 12 males), 24 age-matched children without RD (mean age = 10.58 years, range 10–12, 9 males), 20 adults with RD (mean age = 19.63 years, range 18–22, 9 males), and 23 age-matched adults without RD (mean age = 19.65 years, range 18–24, 10 males). All participants were native Chinese speakers, right-handed, without neurological or psychiatric diseases, have not learned Spanish. All adults with RD and parents of children in the RD group reported a history of reading difficulties including poor reading accuracy and fluency. The IRB at Sun Yat-Sen University approved the study and consent procedures. All participants/parents of child participants signed written consent before we conducted any testing. Children also gave assent.

Behavioral assessments

The Chinese character naming test is a measure of word decoding accuracy, in which the participant is asked to read aloud 150 Chinese characters without a time limit. The total number of characters read correctly is the raw score. The Chinese sentence reading fluency test is a measure of reading fluency and reading comprehension, in which the participant is asked to silently read 100 sentences of varying length and make a judgment whether each sentence makes sense in meaning, and the time limit is 3 min. The total number of characters in sentences that are correctly judged is the raw score. Norms for fifth-grade children on these two tests are available from a previous study96. We tested 215 adults without RD from the same colleges where we found adults with RD to develop a norm for adults on the character naming test (mean ± standard deviation: 140.02 ± 6.41 characters) and the sentence reading fluency test (mean ± standard deviation: 1379.88 ± 377.21 characters).

The one-minute Chinese character naming test is a character reading fluency test that is composed of two parts: 150 regular characters and 150 irregular characters. Regular characters are those that share the same pronunciation with the phonetic radical, while irregular characters are those that have a different pronunciation from the phonetic radical. The test requires participants to read the characters as quickly and accurately as possible within one minute. The one-minute character naming test was administered to 201 college students (100.45 ± 17.70 for regular characters, 80.92 ± 19.10 for irregular characters) and 217 fifth-grade children (71.18 ± 18.11 for regular characters, 48.72 ± 18.29 for irregular characters) to develop norms for adults and children, respectively.

In addition to the three reading tests used for screening, all participants also completed meta-linguistic awareness tests for phonological awareness, morphological awareness, and orthographic awareness, as well as cognitive ability tests for working memory and rapid automatized naming (RAN). Phonological awareness was tested with English words and pseudowords while morphological awareness and orthographic awareness were tested with Chinese materials. All of the participants were Chinese-English bilinguals and English but not Chinese materials were used in the phonological awareness tests because all Chinese characters are monosyllabic and there may be a ceiling effect in adults if we use Chinese materials in the phonological awareness tests. Using English pseudowords was also helpful in equalizing the material familiarity among participants.

Phonological awareness was measured using a 30-item initial sound deletion test and a 40-item pseudoword rhyming test. In the initial sound deletion test, participants were orally presented with a word and asked to delete the first consonant sound and then pronounce the rest part of the word (e.g., the word “sock” /sɑk/ should be pronounced as “ock” [ɑk] after the initial sound is deleted). In the pseudoword rhyming task, participants were orally presented with a pair of English pseudowords and asked to determine if the two pseudowords rhyme.

Morphological Awareness was tested in a 30-item homophonic morpheme test and a 30-item homographic morpheme test. In the homophonic morpheme test, participants were asked to choose one character from four homophones to form a meaningful word with a given character. For example, __段, (线/xian4/“line,” 献/xian4/ “dedicate,” 羡/xian4/ “envy,” 县/xian4/ “town”). In the homographic morpheme test, participants were presented with a pair of two-character words containing the same morpheme (e.g., 道/dao4-li3/ “reason” and 会/li3-hui4/ “pay attention to”) and they were asked to judge whether the morpheme had the same meaning in the two words.

Orthographic awareness was measured with a 60-item character correction test and a 30-item delayed copy test. In the character correction test, participants were asked to identify and correct wrongly written characters. In the delayed copy test, participants were presented with infrequently-used characters for 500 ms and asked to write down the character they had just seen. Raw scores for the phonological awareness, morphological awareness, and orthographic awareness tests were the number of correct items.

Working Memory was tested using forward and backward digit spans. Rapid Automatized Naming (RAN) was tested using digit RAN and picture RAN. In each RAN test, there are 50 items, and the time taken to name all of the 50 items was recorded in seconds and used as the raw score.

fNIRS procedures and stimuli

A passive speech perception task was used to examine Spanish perception with a rapid event-related design. A total of 120 pairs of Spanish consonant-vowel (CV) syllables were used in this task with a consonant and a vowel in each syllable. There were four types of CV syllable pairs: (1) the two Spanish syllables were identical and the sounds had high similarity to those in Chinese (e.g., /pi/-/pi/), (2) the syllables were different but the sounds had high similarity to those in Chinese (e.g., /pi/-/bi/), (3) the syllables were identical but the sounds had low similarity to those in Chinese (e.g., /je/-/je/), and (4) the syllables were different and the sounds had low similarity to those in Chinese (e.g., /je/-/ge/). Participants were asked to listen carefully to the stimuli and to keep their heads as still as possible during the task in order to reduce motion artifacts. There were also 60 baseline trials, for which two black crosses were presented on the screen sequentially. All trials were randomly presented and divided into four runs with around 4 min for each run.

The experimental procedure is displayed in Supplementary Fig. 1. At the very beginning of each run, there was a 5000 ms fixation cross. Each single trial began with a brief black fixation cross (200 ms) warning the onset of a new trial and then two Spanish CV syllables were presented sequentially in the auditory modality with a duration of 800 ms for each and a 200 ms blank between the two syllables. The SOA was jittered between 3.5–4 s.

In the speech production task, participants were asked to imitate 26 multi-syllabic Chinese pseudowords and 26 multi-syllabic Spanish words. Each word/pseudoword was repeated 3 times sequentially, resulting in 156 trials in total divided into two runs with 78 trials per run. Chinese pseudowords and Spanish words were randomized in the presentation. Each run began with a silent period (5000 ms) prior to the onset of the first word. For each trial, the audio stimulus was displayed for 1500 ms with a black fixation cross shown on the screen, followed by a red cross to cue the start of the imitation phase which lasted for 1500 ms. The ITI was jittered at 250, 500, 750 or 1000 ms. A baseline trial was arranged after the third imitation of each word, during which, a cross was presented on the screen for 3000 ms and the participant did not need to do anything. An additional baseline trial was inserted randomly after the first or second imitation phase for each word. Responses were recorded through a microphone connected to the monitor through the E-prime SRBOX. The whole process took approximately 20 min to complete. The experimental procedure is displayed in Supplementary Fig. 1. The mean number of syllables per word/pseudoword was matched in Chinese and Spanish. The speech perception and speech production tasks were counterbalanced across participants.

Phonetic analyses in speech production

To examine whether participants with RD performed worse than individuals without RD in the foreign speech imitation task, we measured the VOT of two initial stops (i.e. /b/, /d/) and vowel’s frequency formants for 5 vowels (i.e. a, o, i, e, u) in 5 Spanish words in the speech production task (i.e. dificil, dado, brazo, bueno, bebe). VOT and formant extraction were performed in Praat97.

Since formant frequencies are influenced by anatomical/physiological differences (e.g., vocal tract shape, and gender)98, a vowel normalization procedure was employed to eliminate the impact of these variables among participants. The vowel frequency formant normalization was performed using the Vowels R package99. We followed the approach in Lobanov100, which is speaker-intrinsic, vowel-extrinsic, and formant-extrinsic, and it performs best on mitigating the effects of speakers’ gender and age-related variations while preserving valuable sociolinguistic information101. In order to quantify the foreign speech imitation performance, we calculated the Euclidean distance between each participant’s vowels and a native Spanish model speaker’s vowels in the F1-F2 vowels space.

fNIRS data acquisition

Changes in the oxygenated hemoglobin (HbO) and deoxygenated hemoglobin (HbR) concentrations were measured with a continuous-wave (CW) NIRSport2 system (NIRx, Medical Technologies LLC, Berlin, Germany) sampled at 4.4 Hz. Two wavelengths of near-infrared light (760 and 850 nm) were used, with a distance between pairs of source and detector probes set at 3.0 cm. Two 4 × 4 probe sets were placed on the bilateral frontal, parietal, and temporal areas, with each comprising 8 emitter and 8 detector probes, forming 48 channels in total. The international 10–20 system was used to guide and standardize the optode placement, with the D8 and D10 detectors aligned with T7 and T8, respectively, (Supplementary Fig. 2).

To determine the anatomical localization of each optode, we collected T1-weighted images from a typical adult participant using a 3.0 Tesla Prisma Siemens scanner with the following parameters: time repetition = 2300 ms; time echo = 3.39 ms; flip angle = 7°; slice thickness = 1 mm; voxel size = 1 × 1 × 1 mm. The images were normalized to MNI coordinate space using SPM12 and brain regions under each optode were determined using the AAL template. The position of a channel was defined as the center of the two adjacent emitting and receiving optodes.

fNIRS data pre-processing

Data pre-processing began with a manual visual check on signal quality following Liang et al.102. The spectrograms of all channels were plotted and the NIRS channels without a clear, visible cardiac component (a spike at ∼1–1.5 Hz in the spectrograms) or only with random noise were regarded as low quality. Visual inspections were conducted by two experienced researchers and for inconsistent inspections, we invited a third researcher. If more than 20% of a participant’s channels were low quality, then that participant’s data would be excluded from further analysis. If the number of channels with low quality did not exceed 20%, these channels were excluded from further analysis and the rest channels from this participant were included for further analysis. On average, 3 channels were excluded in each group (Children with RD: 2.52; children without RD: 3.39; adults with RD: 3.8; adults without RD: 3.26).

Then we used Homer 2103 for further pre-processing. First, the raw fNIRS intensity signals were converted into optical density using the Homer2 hmrIntensity2OD function. Next, wavelet filtering was conducted for motion correction using the hmrMotionCorrectWavelet function (iqr = 0.8). According to Di Lorenzo et al.104, the use of 0.8 was recommended for analyzing short event-related data. The data were then band-pass filtered between 0.02 Hz and 0.5 Hz to attenuate low-frequency drift and cardiac oscillations. Optical density signals were converted to concentration changes (μmol/L) of HbO and HbR using the modified Beer–Lambert law with a default partial pathlength factor of 6.0 for each wavelength.

General linear model (GLM) analysis

The preprocessed fNIRS data were imported to the NIRS-KIT toolbox105 based on the MATLAB environment for individual-level analysis. A general linear model (GLM) was used to evaluate channel-wise task-evoked neural activation for each individual participant. Because of the lower signal-to-noise ratio of HbR compared to HbO106, only concentration changes of HbO were investigated in the GLM.

For the speech perception task, five conditions were included in the model (identical Spanish syllables that are similar to Chinese, different Spanish syllables that are similar to Chinese, identical Spanish syllables that are dissimilar to Chinese, different Spanish syllables that are dissimilar to Chinese, and baseline trials). For the speech production task, three conditions were included (Chinese pseudowords, Spanish words, and baseline trials). The model was convolved with the canonical hemodynamic response function, and then model estimation was conducted to calculate how well the model fits with the real brain signal at each channel. The contrast of each lexical condition minus the baseline condition was then defined to estimate signal magnitudes specifically related to each type of stimuli. Finally, beta values from the model estimation were entered for subsequent group-level statistical analysis using ANCOVAs for each channel. Multiple comparison correction was conducted using FDR correction107, since we had 48 channels.

Classification based on fNIRS data

The beta values for the contrast of lexical minus baseline for each channel from the general linear model estimation were extracted, resulting in a feature vector of 1 × 48 for each participant in each task.

We used supporting vector machine (SVM) for classification of RD readers from readers without RD, due to its higher accuracy than other methods for small datasets108,109. We used an open-source machine learning library in Python, scikit-learn110 for the SVM implementation.

First, the feature vector was normalized across participants. In SVM, C is a regularization parameter that determines the trade-off between maximizing the margin and minimizing the classification error while γ is a parameter that influences the shape of the decision boundary. We optimized these two parameters in the radial basis function kernels (RBF-SVM) using a cross-validation grid search among the values of 2 N (N from −5 to 11 for C and from −9 to 13 for γ) in the training dataset, and then optimal parameters were used to test the classifier. We performed a leave-one-out cross-validation (LOOCV) to assess the classifier’s performance until all participants were tested. Last, the number of correct predictions was divided by the total number of participants to calculate the accuracy of cross-validation.

In order to speed up computation and improve performance, we employed recursive feature elimination (RFE) to reduce the impact of irrelevant features in this study. The RFE approach involves a nested LOOCV strategy, wherein the inner LOOCV is performed on the training set of each outer SVM LOOCV fold. The primary objective is to identify an optimal subset of features that contribute most significantly to the classification task. Since we used LOOCV to estimate the generalization ability of the classifier, the optimum feature set was different in the training dataset for each fold of LOOCV. Therefore, when analyzing the contributions of different brain regions, the weights of the features were defined as the frequency of appearing in the optimum feature set across all cross-validation folds. By employing RFE within the LOOCV framework, we were able to optimize the feature selection process and obtain a robust and reliable set of features that consistently contributed to accurate classification across different folds.

We performed a permutation test to evaluate whether the predictive validity of the model was higher than chance. Participants’ feature vectors were shuffled across participants to generate a randomized matrix, and the model was trained and cross-validated as previously described. The data randomization procedure was repeated 1000 times to obtain a null distribution of accuracies. The p-value is the proportion of permutation tests with an accuracy higher than the actual classification accuracy. A significance threshold of 5% (p < 0.05) was employed.

In order to find the task with the highest classification accuracy for distinguishing RD readers and readers without RD in children, adults or children and adults combined, we compared the model performance on the speech perception task, the speech production task and a combination of the two tasks.