Background

Genomic disorders are caused by pathological structural variation in the human genome usually arising de novo during parental meiosis [1,2,3,4]. The most common pathogenic variety of these rearrangements are copy number variants (CNVs), i.e. a deletion or duplication of > 1 kb of genetic material [3, 5, 6]. The clinical phenotypes of genomic disorders are varied. They include congenital dysmorphisms, neurodevelopmental, neurodegenerative, and neuropsychiatric manifestations, and even more common complex phenotypes such as obesity and hypertension [7,8,9,10,11,12]. CNVs have been observed in 10% of sporadic cases of autism [13, 14], 15% of schizophrenia cases [15, 16], and 16% of cases of intellectual disability [17]. These and other associations highlight the importance of structural variation to human health and the need to understand the factors influencing how they arise.

There is an intense interest in understanding the mechanisms by which CNVs form [18, 19]. In several regions of the genome, de novo CNVs with approximately the same breakpoints recur in independent meioses (recurrent CNVs) [1, 20]. The presence of segmental duplications flanking these intervals is a hallmark feature of recurrent CNVs. It is hypothesized that misalignment and subsequent recombination between non-allelic low copy repeat (LCR) segments within the segmental duplication regions is the formative event giving rise to the CNV [21, 22], so-called non-allelic homologous recombination (NAHR). Risk factors that may favor NAHR have been investigated and include sequence composition and orientation of the LCRs themselves [21, 23] as well as the presence of inversions at the locus [24, 25].

Parental sex bias for the origin of recurrent de novo CNVs remains unexplained. De novo deletions at the 16p11.2 and 17q11.2 loci are more likely to arise on maternally inherited chromosomes [26,27,28,29]. Deletions at the 22q11.2 locus show a slight maternal bias as well [30]. In contrast, deletions at the 5q35.3 locus (Sotos syndrome [MIM: 117550]) display a paternal origin bias [31, 32]. Deletions at the 7q11.23 locus (Williams syndrome [MIM: 194050]) do not show a bias in parental origin [24]. While it has been suggested that sex-specific recombination rates might influence sex biases in NAHR [26], this hypothesis has not been formally tested.

The majority of recurrent CNVs are thought to form during meiosis when homologous chromosomes align and synapse during prophase I [33]. It is well established that meiosis differs significantly between males and females. In males, spermatagonia continuously divide and complete meiosis throughout postpubescent life with all four products of meiosis resulting in gametes. In contrast, in human females, oogonia are established in fetal life and enter into an extended period of prolonged stasis in prophase I of meiosis until they complete meiosis upon ovulation and fertilization [34]. Additionally, in female meiosis, only one of four products of meiosis result in a gamete. Sexual dimorphism in meiosis extends to the patterns and processes of recombination during meiosis [33]. Here we seek to ask whether local sex-specific rates in meiotic recombination can predict the parental bias for the origin of recurrent de novo CNVs.

Methods

Parent of origin determination

Literature search and parental origin data curation

For this analysis, we considered the 55 known genomic disorder CNV loci described in Coe et al. [7]. A locus was eligible for inclusion in the current analysis if it is flanked by LCRs, i.e. mediated by NAHR, and not imprinted (n = 38 eligible loci). For each of these 38 loci, we performed a systematic PubMed search to identify published data on parental origin. Studies were admitted to this paper’s analysis when the following criteria were met: (1) the study detailed parent of origin data for one of the 38 eligible NAHR-mediated loci as designated by Coe et al. [7], (2) the authors of the study interrogated the entire canonical CNV interval to confirm the presence of a deletion or duplication in the patients, (3) the authors determined the investigated CNVs were de novo, and (4) the study clearly treated monozygotic twins as one meiotic event and not two (Additional file 1: Supplemental Methods, Additional file 2: Table S1, and Additional file 3: Table S2). The literature search led to a manual review of 1268 papers, out of which we identified 77 manuscripts across 24 loci with suitable data for analysis: 1q21.1 [35,36,37,38,39], 1q21.1 TAR [40], 2q13 [37], 3q29 [37,38,39,40,41,42], 5q35 [31, 32], 7q11.23 [24, 40, 43,44,45,46,47,48,49,50,51,52,53,54], 8p23.1 [55, 56], 11q13.2q13.4 [57], 15q13.3 [38, 40, 58], 15q24 (AC, AD, BD, and BE intervals) [59,60,61,62,63,64], 15q25.2 [65,66,67],16p11.2 [26, 37, 40, 68,69,70], distal 16p11.2 [37, 38, 70], 16p11.2p12.1 [71], 16p31.11 [37], 17p11.2 [72,73,74,75,76], 17q11.2 [28, 29, 77], 17q12 [37, 38, 78], 17q21.31 [19, 25, 67, 79,80,81,82,83,84], 17q23.1q23.2 [69, 85] and 22q11.2 [30, 43, 53, 86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102] (Table 1). For the remaining 14 loci, no published parent of origin data could be identified. At the 3q29 locus, we generated new data to determine the parent of origin for de novo events (http://genome.emory.edu/3q29/).

Table 1 Summary of CNV loci included in literature search and curated studies

Determination of parental origin for 3q29 deletion

Study subject recruitment

This study was approved by Emory University’s Institutional Review Board (IRB00064133). Individuals with a clinically confirmed diagnosis of 3q29 deletion were ascertained through the internet-based 3q29 registry (https://3q29deletion.patientcrossroads.org/) as previously described [103]. Blood samples were obtained from 14 families. SNP genotyping was performed on 12 of the 14 families (10 full trios, 2 mother–child pairs) using the Illumina GSA-24 v 3.0 array. For 2 full trios (6 samples), parent of origin was determined from whole-genome sequence data on Illumina's NovaSeq 6000 platform. Quality control was performed with PLINK 1.9 [104] and our custom pipeline (Additional file 1: Supplemental Methods).

Parental origin analysis

Parental origin of the 3q29 deletion was determined for all 14 families using PLINK 1.9 [104]. SNPs located within the 3q29 deletion region (chr3:196029182–197617792; hg38) were isolated for analysis and the pattern of Mendelian errors (MEs) were analyzed. The parent with the most MEs was considered the parent of origin for the 3q29 deletion (Additional file 1: Supplemental Methods). The mean age of fathers in our 3q29 cohort was collected from self-reported data in conjunction with the Emory University 3q29 project (http://genome.emory.edu/3q29/) and compared to the U.S. average (NCHS; https://www.cdc.gov/nchs/index.htm) via a two-tailed two-sample t-test using R [105].

Calculation of recombination rates and ratios

Chromosome male and female recombination rates (cM/Mb) were obtained from the deCODE sex-specific maps, which are based on over 4.5 million crossover recombination events from 126,427 meioses, with an average resolution of 682 base pairs [106]. The recombination rate (cM/Mb) data from deCODE is publicly available as recombination rates calculated for a physical genomic interval bounded by two SNP markers (Additional file 1: Supplemental Methods). Therefore, for our calculation of the average male and female recombination rates, each bounded recombination rate was weighted by the total number of base pairs contained within the respective SNP marker interval. Weighted rates were then averaged across the CNV interval for males and females, separately. The ratio of the weighted average male and female recombination rates was then calculated for each CNV interval by dividing the weighted average male recombination rate by the weighted average female recombination rate (Additional file 1: Figure S1). To account for slight differences in the recombination rate ratios calculated for the different LCR22 intervals at the 22q11.2 locus we used an adjusted recombination rate ratio composed of the weighted recombination rate ratios calculated for each LCR22 interval. Weights were based on the estimated population prevalence of the different 22q11.2 deletion intervals (Additional file 1: Table S3) [107].

Logistic regression analysis

Parental origin data was curated for CNVs at the 24 CNV loci from 77 independent studies; only independent samples were included in the analysis (duplicate or overlapping samples were removed). For each CNV locus, the male to female recombination rate ratio was calculated as described above. A logistic regression model was fitted to the data with the loge-transformed male to female recombination rate ratio as the predictor and parental origin (paternal vs. maternal) as the response variable. We performed a secondary analysis stratified by deletions and duplications. See Table 2 and Additional file 4: Table S4 for the data calculated and used in the logistic regression analyses.

Table 2 Summary of genomic disorder loci CNVs recombination calculations

Linear regression analysis

For linear regression, locus-specific estimates for parental origin were derived by combining the data from all published studies for a given locus. To alleviate the uncertainty introduced by small sample sizes, only those loci with more than 10 observations were included. The loge-transformed combined male to female parental origin count ratio for each locus was regressed on the calculated average loge-transformed average male to female recombination rate ratio for that locus’ CNV interval. Each locus was weighted based on its sample size.

Results

Recurrent genomic disorder loci literature search

We conducted a systematic literature search for the 38 non-imprinted and NAHR-mediated CNV loci in Coe et al. [7] (Table 1, Additional file 2: Table S1). We identified parent-of-origin studies that met inclusion criteria as stated in “Methods” section. 77 studies met inclusion criteria; from these 77 studies, data were curated for 24 loci, including copy number variants at 1q21.1 [35,36,37,38,39], 1q21.1 TAR [40], 2q13 [37], 3q29 [37,38,39,40,41,42], 5q35 [31, 32], 7q11.23 [24, 40, 43,44,45,46,47,48,49,50,51,52,53,54], 8p23.1 [55, 56], 11q13.2q13.4 [57], 15q13.3 [38, 40, 58], 15q24 (AC, AD, BD, and BE intervals) [59,60,61,62,63,64], 15q25.2 [65,66,67],16p11.2 [26, 37, 40, 68,69,70], distal 16p11.2 [37, 38, 70], 16p11.2p12.1 [71], 16p31.11 [37], 17p11.2 [72,73,74,75,76], 17q11.2 [28, 29, 77], 17q12 [37, 38, 78], 17q21.31 [19, 25, 67, 79,80,81,82,83,84], 17q23.1q23.2 [69, 85] and 22q11.2 [30, 43, 53, 86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102] (Table 2). Each locus has between one and twenty independent studies representing in total 1977 de novo deletion (N = 1913) and duplication (N = 64) events (Table 2).

Parent of origin of 3q29 deletion

We determined parent of origin in 12 full trios where a proband had a de novo 3q29 deletion; in 2 additional trios where only proband and maternal DNA samples were available, parent of origin was inferred. For the 12 trios evaluated by SNP arrays, in all cases, the number of Mendelian errors between the presumed inherited (intact) parental allele was zero, and the mean Mendelian errors for the presumed de novo parent of origin allele were 41, with a range of 27–66. For the two trios evaluated with sequence data, Mendelian errors were 20–33-fold elevated when comparing the inherited versus de novo parent. In these 14 trios, 13 deletions (92.9%) arose on the paternal genome indicating a significant departure from the null expectation of 50% (p = 0.002, binomial exact). When accounting for only full trios, 11 of 12 (91.7%) deletions arose on paternal haplotypes (p = 0.006, binomial exact), altogether indicating there is a paternal bias for origin of the 3q29 deletion (Additional file 1: Table S5). We examined the age distribution of male parents in our cohort; the mean age is 34 years (median = 34 years) and is not significantly different from the 2018 U.S national average, (31.8 years) (p = 0.08, Two-tailed two-sample t-test), These data indicate the bias in the 3q29 sample is unlikely to be due to oversampling of older fathers (Additional file 1: Table S5).

Meiotic recombination and parental origin

We tested the hypothesis that sex-dependent differences in meiotic recombination could explain the parental biases observed for recurrent genomic disorder loci mediated by NAHR. We determined the male and female origin counts of the CNVs curated from the literature search. Of the 1977 CNVs, 870 were paternal in origin and 1107 were of maternal origin. We calculated the average male and female recombination rates (cM/Mb) across the CNV intervals at all 24 loci using recombination rates published by the deCODE genetics group [106] (Additional file 1: Figure S2–S12). We fit a simple logistic model to the data, with the male-to-female recombination rate ratio as the predictor and parental origin as the response variable (Table 2; Additional file 4: Table S4). Our data reveal that the sex-dependent recombination rate ratio significantly predicts parental de novo origin of a given CNV (p = 1.07 × 10–14, β = 0.6606, CI95% = (0.4980, 0.8333), OR = 1.936) (Fig. 1). In other words: for a given region, the higher the male recombination rate is relative to the female rate, the more likely a CNV formed in that region will be paternal in origin. Stratified analyses on deletions and duplications separately lead to a nearly identical model (Deletions: p = 8.88 × 10–14, β = 0.6721, CI95% = (0.5009, 0.8546), OR = 1.9584; Duplications: p = 0.02, β = 0.8304, CI95% = (0.1508, 1.6017), OR = 2.2942) (Additional file 1: Figure S13–S14, Table S6–S7). Simple linear regression on the subset of CNV loci with more than 10 samples, shows the striking correlation between relative recombination rates and parental origin, where relative recombination rates explain 85% of the variance in parental bias (Additional file 1: Figure S15 and Table S8). Our logistic model can be used to predict paternal origin rates for any locus with estimable recombination in males and females, and we have done so (Additional file 1: Table S9). CNVs at the 15q13.3 and 17q23 both are predicted to have a paternal origin approximately 60% of the time, while at the 16p11.2 distal locus CNVs are predicted to have a maternal origin 76% of the time (Additional file 1: Table S9). If correct, our model would predict these loci exhibit a bias in parental origin.

Fig. 1
figure 1

Recombination rates associate with parental origin. Predicted (curve) and observed paternal origin proportions for 1977 CNVs from 24 loci. Curated parent of origin data from 77 published studies are collapsed by loci into single data points; recombination rate ratios are the average of the metric for all CNVs within the data point. Data point size and color correspond to the number of CNV data collapsed into the data point. Recombination rate ratios predict parental origin for CNV mediated by NAHR (p = 1.07 × 10–14, β = 0.6606, CI95% = (0.4980, 0.8333), OR = 1.936)

Discussion

Parent of origin bias for de novo events at recurrent CNV loci has been well-documented but has lacked a compelling explanation. Our analysis of data gathered on 1977 CNVs from 77 published reports demonstrate that sex-specific variation in local meiotic recombination rates predicts parent of origin at recurrent CNV loci. Human male and female meiotic recombination rates and patterns differ greatly across the broad scale of human chromosomes. Recombination events are nearly uniformly distributed across the chromosome arms in females but tend to be clustered closer to the telomeres in males [108]. We note that this pattern has been previously recognized [26]. Here we have formally tested the hypothesis that recombination variation drives parent of origin variation using a rigorous, statistical framework (Fig. 1) and provided an estimate for the variance in parent of origin bias that is due to sex-specific recombination rates (Additional file 1: Figure S15).

Investigations into the mechanism by which recurrent CNVs arise have focused on LCRs and their makeup [1, 109]. These regions are composed of units of sequence repeats that vary in orientation, percent homology, length, and copy number. Consequently, LCRs are mosaics of varying units, imparting complexity to LCR architecture [23]. The frequency of NAHR events mediated by LCRs is a function of these characteristics and other features of the genomic architecture [21]. Specifically, the rate of NAHR is known to correlate positively with LCR length and percent homology and decrease as the distance between LCRs increases [19, 21]. However, because LCRs are challenging to study with short-read sequencing technology, the population-level variability of these regions is not well described [110]. Recent breakthroughs with long-read sequencing and optical mapping have revealed remarkable variation in LCRs [111,112,113], and haplotypes with higher risks for CNV formation have now been identified [114]. LCRs are substrates for NAHR [1], and thus are subject to the recombination process. Local recombination rates may influence how likely an NAHR event will happen between two LCRs. Therefore, when analyzing LCR haplotypes and their susceptibility to NAHR, one would need to take into account sex differences in recombination. For example, at loci with maternal biases, specific risk haplotypes may be required for males to form CNVs and vice versa. Greater enrichment of GC content, homologous core duplicons or the PRDM9 motifs, or other recombination-favoring factors may also be required [1, 19].

Variation in recombination rates between sexes is well established [108, 115,116,117,118]. Prediction of individual risk may also need to consider individual variation in meiotic recombination, particularly due to heritable variation and the presence or absence of inversion polymorphisms [117, 119]. Variants in several genes, including PRDM9, have been shown to affect recombination rates and the distribution of double-stranded breaks in mammals [120, 121]. Common alleles in PRDM9 are evidenced to affect the percentage of recombination events within individuals that take place at hotspots [120], and variants in RNF212 are associated with opposite effects on recombination rate between males and females [116, 121]. The unexplained variance in our study may be due to these additional factors, which are rich substrates for future study.

Many human genetic studies have observed correlations between inversion polymorphisms and genomic disorder loci [25, 122]. Because these inversions are copy-number neutral and often located in complex repeat regions, [123] they can be difficult to assay with current high-throughput strategies and their true impact remains to be explored. One model proposes that during meiosis these regions may fail to synapse properly and increase the probability of NAHR [124, 125]. Another theory suggests formation of inversions increases directly oriented content in LCRs leading to an NAHR-favorable haplotype [126]. Supporting these theories, inversion polymorphisms have been identified at the majority of recurrent CNV loci [24, 25, 30, 122, 124, 126, 127]. At the 7q11.23, 17q21.31, and 5q35 loci [24, 25, 127], compelling data indicts inversions as a highly associated marker of CNV formation. However, heterozygous inversions are known to suppress recombination perturbing the local pattern of recombination and altering the fate of chiasmata [119]. The analysis presented here strongly suggests that recombination is the driving force for CNV formation giving rise to an alternate explanation for the association between inversions and CNVs; they are both the consequence (and neither one the cause) of recombination between non-allelic homologous LCRs. Inversions and CNVs appear to be associated because both are being initiated by aberrant recombination. Viewing the system in this manner also explains the frequency of individual inversions at CNV loci. Inversions are arising via rare aberrant recombination, like CNVs, but subsequently being driven to higher frequency by natural selection, because they act to suppress recombination and “save offspring” from deleterious genomic disorders. Of course, frequent mutations leading to inversions and the details of LCR structure such as relative orientation and homology within a genomic region may promote or impede CNV formation in a locus-specific manner [128,129,130]. Further exploration of this relationship with improved genomic mapping can test these alternative models [131]. One testable prediction of the model described here is that inversions should be at higher frequency at loci giving rise to highly deleterious CNVs, as opposed to loci harboring recurrent benign CNVs.

To our knowledge, this study is the first comprehensive investigation of parental origin of recurrent, NAHR-mediated CNV loci. Investigations of predominantly nonrecurrent CNVs show paternal bias [132,133,134]. Unlike recurrent CNVs, nonrecurrent CNVs are mostly formed via non-homologous end joining (NHEJ) and replicative mechanisms [1, 135, 136]. The standing hypothesis is that replication-based mechanisms of nonrecurrent CNV formation, which are known to accumulate errors in male germlines, contribute to this bias [132]. Our study reinforces the idea that the factors influencing recurrent CNVs differ from those impacting nonrecurrent CNVs. Future genome-wide analyses with larger sample sizes can further help refine our understanding of the divergent forces at play affecting recurrent and nonrecurrent CNV formation.

We conducted a comprehensive literature search at 38 loci and ultimately identified 1977 samples for analysis. We note that the majority of the data come from 7 well-studied loci (Table 1). While we thoroughly curated the data in a systematic way, it is possible that our data is subject to publication bias, where loci that exhibit parent of origin biases are more likely to have parental origin reported. Further exacerbating potential publication bias, genetic testing for the affected patient (and even more so for the parents) can be difficult to obtain due to concerns such as insurance coverage, potential future discrimination, and privacy concerns [137,138,139,140]. However, we note individuals with CNVs are generally not ascertained or recruited under the expectation that recombination affects parent of origin, and therefore, any potential publication or ascertainment bias is unlikely to confound the results of our analysis. Analysis of a larger cohort of CNV loci including benign CNVs will give greater insight into the role of recombination, and sex differences in recombination influencing parent of origin in CNVs.

Our estimates of recombination rates summarize CNV-scale (broad-scale) patterns of recombination, rather than fine-scale patterns near the sites of relevant recombination events that form these CNVs—LCRs. For example, local sex-specific hotspots within LCRs could be the underlying drivers behind the correlation between recombination rates and parental origin. Given the nature of repetitive regions like LCRs and our inability to adequately interrogate them with current sequencing technologies, accurate recombination data across and within the LCR regions is not available. In other words, the data is currently insufficient to conclude whether or not these broad-scale patterns are tightly correlated with fine-scale recombination rates in and around the LCRs. The best available data in the field allows us to infer the following: broad-scale patterns of recombination tightly predict patterns of parental origin.

Conclusions

In this study, we determined male and female differences in meiotic recombination rates significantly predict parent of origin for recurrent CNV loci. Combining the sex-specific recombination landscape and the mechanistic factors underlying it with a more detailed understanding of existing structural factors at genomic disorder loci can be expected to help guide standards used to identify and perform genetic counseling for individuals at risk of genomic rearrangement.