Introduction

Bronchopulmonary dysplasia (BPD) is a common complication of preterm birth and remains a global problem1,2. BPD is broadly defined as a need for supplemental oxygen support and/or respiratory support around the time that a preterm infant is close to term2. Given the observational nature of this diagnosis, it comes as no surprise that it has offered limited prognostic and mechanistic homogeneity2. In pursuit of a deeper understanding of BPD pathophysiology, twin studies, candidate gene studies, and whole genome association studies have been conducted to identify and validate genetic risk factors for BPD3.

Initial twin studies suggested a substantial genetic influence on BPD prevalence, but recent research contradicts this claim for extremely premature infants3,4. Similarly, candidate gene studies and genome wide association studies (GWAS) have struggled to identify reproducible relationships between variants and BPD3. However, some notable exceptions exist. First, Hadchouel et al. implicated SPOCK2 in BPD in a discovery series and a replication population; they also demonstrated increased SPOCK2 expression in hyperoxia exposed rats5. Secondly, through exome sequencing, Li et al. implicated 258 genes, with significant enrichment in collagen fibril organization, morphogenesis of embryonic epithelium, and regulation of Wnt signaling pathways6. Finally, Blume et al., validated links between genes that regulate immune cell adhesion and BPD7.

While these inconsistent findings may be due to limited sample size and lack of power, we echo the sentiment articulated by Lal et al., that BPD likely comprises distinct sub-phenotypes driven by varied pathophysiological processes and genetic factors8. Furthermore, how genetic factors affect BPD risk is likely highly dependent on the specific conditions during pregnancy, birth, and the postnatal period, implying significant gene environment interactions.

Overall, the heterogeneity in BPD diagnosis and the varying effects of specific genetic variants influenced by environmental factors may considerably complicate the genetic study of BPD. Therefore, the complex web of environmental variables surrounding premature infants may obscure the potential genetic signal contributing to BPD risk. To address this issue, we look beyond the neonatal intensive care unit context. Considering the anticipated impact of BPD-linked genetic variants on lung growth, inflammation, and injury response, we propose that these variants may also manifest in phenotypes beyond BPD, including that of asthma. To test this hypothesis, we completed a phenotype wide association screen in children enrolled in the biobank at the Center for Applied Genomics (CAG), exploring potential links between SNPs previously associated with BPD and other phenotypes.

Results

We found 60 SNPs that were previously associated with bronchopulmonary dysplasia (Supplemental Table I)9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42. Of these, 3 were significantly associated with at least one phenotype in the current study after Bonferroni correction (Table 1). First, the effect alleles of SNP rs3771150 and rs3771171 were found to be significantly associated with higher mean eosinophil percentage (P = 1.06 × 10−5; coefficient 0.03; 95% CI 0.02–0.04 and P = 6.63 × 10−5; coefficient 0.03; 95% CI 0.02–0.04, respectively) amongst European patients (Fig. 1). Second, rs2077079 was significantly associated with higher odds of chronic liver disease in meta-analysis of all ancestries (P = 1.37 × 10−5; OR 1.39; 95% CI 1.20–1.61).

Table 1 Significant associations revealed by PheWAS by ancestry.
Fig. 1
figure 1

Manhattan plot with PheWAS results for rs3771150 and rs3771171 in a European cohort. Significance threshold of P = 2.3 × 10−5 is represented by a dotted horizontal line.

Of these associations, the associations between rs3771150 and rs3771171 and eosinophil percentage were externally validated (Table 2). Our European cohort in whom this association was found included 3736 (53%) males and 3263 (47%) females. For the 6972 patients that had age at time of blood draw available, the mean age per patient at time of blood draw was 10.0 years with a standard deviation of 4.6 years. Using the Ensenbl VEP tool, these variants were categorized as intron variants for IL18R1 and IL18RAP respectively. Using the OpenTargets interface, the effect alleles of rs3771150 and rs3771171, previously associated with BPD, were found to have been strongly associated with higher “Eosinophil percentage of white cells” (P-value 2.4 × 10−195 and 3.1 × 10−198 respectively) in a study combining UKBiobank with Blood Cell Consortium data43. Furthermore, both are strongly associated with higher odds of asthma in the UKBiobank (P-value 6.0 × 10−26 and 1.7 × 10−26 respectively)44. The AtFKP database implements a Bottom-Line meta-analysis of all datasets in the database to identify significant variant-phenotype associations. This tool again replicated the relationship between rs3771150 and rs3771171 and higher eosinophil percentage (P-value 1.2 × 10−232 and 1.03 × 10−234 respectively) as well as the relationship between both and higher odds of asthma (P-value 1.86 × 10−71 and 2.58 × 10−72 respectively). Beyond these relationships, this tool also identified significant associations with higher odds of chronic obstructive pulmonary disease (COPD) (P-value 6.22 × 10−9 and 2.89 × 10−8 respectively) and lower forced expiratory volume 1 s over forced vital capacity (FEV1/FVC) (P-value 3.48 × 10−8 and 6.22 × 10–9 respectively). Using the AtFKP Signal Sifter tool we can identify the specific region on chromosome two that has variants associated with eosinophil percentage, asthma, COPD, and FEV1/FVC Ratio, we can then visualize the Bottom-Line meta-analysis results for these different phenotypes using Locus Zoom (Fig. 2)45.

Table 2 External validation of associations of interest.
Fig. 2
figure 2

LocusZoom plots of the region surrounding rs3771150 (2:103060851_G/A) and rs3771171 (2:102985950_T/C). These plots visualize the Bottom-Line meta-analysis results for associations between variants in the region and eosinophil percentage, asthma, COPD, and FEV1/FVC Ratio. Lead variants per phenotype are annotated. LD Ref Var = Linkage Disequilibrium Reference variants. Note variation in p-value scale between phenotypes.

After Bonferroni correction, both rs3771150 and rs3771171 remained significantly associated with decreased methylation at probes cg03938978, cg05295703, and cg19795292 and increased methylation at probes cg00652772, cg13315345, cg02648057, cg05168012, and cg08023416 (Table 3). These probes are located on chromosome 2 and relate to both IL18RAP and IL1RL2.

Table 3 Significant association between rs3771150 and rs3771171 and methylation status at select methylation probes.

Both variants are significant protein QTLs (Table 4). In Blood plasma, both SNPs are QTLs for IL1RL1, IL18R1, and IL1R246. Furthermore, in Haploreg rs3771150 and rs3771171 were found to be associated with IL18RAP expression in whole blood47,48.

Table 4 Expression QTL associations for rs3771150 and rs3771171 across tissues.

In the GTEx database, rs3771150 was found to be an expression QTL for AC007278.3 and IL1RL1 in lung tissue. Rs3771171 is an expression QTL for AC007278.3 in whole blood and for both AC007278.3 and IL1RL1 in lung tissue. Both Rs3771150 and rs3771171 are also splicing QTLs for IL1RL1 in lung tissue (NES 0.33 and 0.34, p-value 3.8 × 10−16 and 1.6 × 10−16, respectively). Since these variants are in high linkage disequilibrium (LD) with an r2 of 0.81 in the HaploReg database, we sought to refine the signal by performing a formal QTL-colocalization analysis for IL1RL1 in lung tissue49. For this, we utilized the CAVIAR high confidence set derived from the GTEx database, leveraging the UCSC Genome Browser50,51. The variant with the highest causal posterior probability (CPP) to affect IL1RL1 expression in lung tissue was rs953934 with a CPP of 0.77. In the GTEx database, rs953934 has a NES of − 0.5 with associated p-value 4.7 × 10−72 for IL1RL1 in lung tissue.

Discussion

In this study, we present a novel and reproducible genetic association between BPD, eosinophilia, and asthma. These findings are consistent with epidemiological data showing increased asthma incidence in children with BPD, as well as a family history of asthma being a risk factor for asthma diagnosis in BPD children52. Specifically, our findings highlight the significance of two intronic SNPs, rs3771150 and rs3771171, within the IL18R1 and IL18RAP genes. The effect allele of these SNPs, initially identified by Floros et al. in an African American cohort, were successfully linked to higher eosinophil percentage in our pediatric cohort, a relationship validated externally32. Moreover, these SNPs were significantly associated with higher odds of asthma in the UKBiobank and with higher odds of asthma and COPD, and lower FEV1/FVC ratio in the AtKFP database. Both variants are known eQTLs for IL1RL1 in lung tissue and in our own data these SNPs were connected with methylation patterns previously associated with protein level changes in IL18R1 and IL1RL153,54. In this context, the high LD between both variants should be noted. Both variants are associated with methylation status in our control data, are expression and splicing QTLs in lung tissue, and have pleiotrophic effects on eosinophil percentage, asthma, COPD, and FEV1/FVC. While QTL colocalization analysis revealed that rs953934 on chromosome two appears to be the most likely variant driving IL1RL1 expression in lung tissue, future investigations using functional assays are required to fully asses if rs953934 is the primary driver between the observed associations. It is intriguing that these findings implicate a single genomic area in pulmonary diseases across the lifespan. This observation appears to support our hypothesis that variants impacting lung inflammation have the potential to impact various pulmonary phenotypes. The identified region on chromosome 2 contains the Il1RL1 gene. This gene encodes an IL-33 receptor known as ST2. ST2 has two transcriptional variants—ST2L and sST2—the first is a transmembrane receptor that activates the NF-KB pathway while the second is a soluble protein that acts as a decoy receptor for IL-33 and decreases ST2L mediated signaling55,56. Aligning with our genetic findings, decreased sST2 levels have been implicated in asthma with type 2 inflammation56,57. These observations have led to ongoing efforts to develop pharmacological interventions targeting IL-33-ST2L signaling for use in asthma58.

Paradoxically, increased as opposed to decreased sST2 expression has been found in patients with asthma and type 2 inflammation56. Similarly, a high as opposed to low sST2 on day of life 14 has been associated with BPD59. Two possible explanations for these observations exist. First, during episodes of acute injury the body may initially rely on a preformed pool of sST2 to prevent runaway inflammation, placing patients with lower baseline sST2 at a disadvantage56. However, this explanation seems less likely, given the presence of a negative feedback loop that significantly induces sST2 secretion60. An alternative explanation that could be explored is that variants within IL1RL1 may shift the balance between sST2 and ST2L, decreasing the efficacy of negative feedback mechanisms. In this context, it is worth noting that both rs3771150 and rs3771171 are sQTLs in lung tissue located between IL1RL1 exon 5 and 661.

We note that we found a significant association with eosinophil percentage only in patients of European ancestry, whereas the initial study implicating these variants in BPD reported an effect limited to their African ancestry cohort. There are several potential explanations for this discrepancy. First, it is possible that this variation could be attributed to random sampling effects. Second, the higher minor allele frequency (MAF) in patients of European ancestry compared to African ancestry (0.28 vs. 0.10 for rs3771150 and 0.29 vs. 0.10 for rs3771171) may have provided us with greater statistical power in the European cohort for the phenotype of interest (higher eosinophil percentage). Finally, variations in LD across populations could contribute to discrepancies in findings among studies.

As outlined in the introduction, the lack of reproducible relationships between genetic variants and BPD is a major challenge facing the field3. By extension, the use of suggestive disease association SNPs is a weakness of the current study, until fully validated. However, validating relationships between genetic variants and BPD is complicated by BPD compromising distinct sub-phenotypes driven by varied pathophysiological processes and genetic sensitivities8. As such, rather than immediately executing a validation study in an inherently heterogeneous BPD phenotype, we designed the current study to inform future validation efforts. Specifically, by assessing if variants implicated in BPD are associated with non-BPD phenotypes in the absence of premature birth, our aim was to better understand the potential physiologic impact of these variants as to identify potential BPD sub-populations. In this light, the association between rs3771150 and rs3771171 and higher eosinophil percentage creates a rational for future studies assessing if a BPD sub-phenotype with more prominent eosinophilic inflammation can be identified. If this is the case, attempts at validating the effect of rs3771150 and rs3771171 in this sub-phenotype may be a powerful approach. Of note, despite the current lack of direct genomic validation, recent mouse models demonstrating a BPD phenotype in mice with IL-33 knockdown support the potential role of IL1RL1 related variants in BPD62. Another limitation is the use of genetic ancestry as a categorical variable as opposed to using quantitative measures of genetic ancestry. This approach was selected to limit the number of confounders included in the PheWAS analysis and by extension maintain power in this pediatric cohort.

Finally, the previously published link between methylation at cg08023416 and particulate matter exposure may hint at environmental interactions48. Given that our findings support that a subset of patients with BPD is predisposed to develop asthma with significant type 2 inflammation, further characterization of the role of ambient air pollution in long term BPD morbidity and early intervention may be important in optimizing long term outcomes.

Conclusion

In conclusion, our study highlights a genetic links between BPD, eosinophilia, and asthma. While the mechanism behind this relationship remains poorly understood, our findings stress the importance of asthma vigilance and early preventative interventions to optimize lung health. The central role of the IL-33-ST2L pathway in pulmonary health across the lifespan opens avenues for future research and therapeutic exploration, pending validation in subsequent studies.

Materials and methods

Single nucleotide polymorphisms selection

SNPs were selected from enriched literature review starting with SNPs previously summarized by Blume et al. and Solaligue et al.7,63 This list was complemented by reviewing all studies in PubMed identified with MeSH terms ‘Bronchopulmonary Dysplasia’ and ‘Polymorphism, Single Nucleotide’. Any SNPs found to be significantly associated with the development of bronchopulmonary dysplasia, either by meeting genome wide significance in GWAS studies or significance by published p-value in candidate gene studies was included in the current study.

Population

Subjects were drawn from The Children's Hospital of Philadelphia biorepository at the Center for Applied Genomics (CAG). The pediatric samples included in this biorepository are linked to subjects’ EMRs. All subjects have consented to both genomic analysis and EMR mining64. This study does not involve human participants and is limited to secondary analysis of existing de-identified data. As such, it does not constitute Human Subjects Research. This study does not involve human participants and is limited to secondary analysis of existing de-identified data. As such, as defined by 45CFR 46 from the United States Department of Health and Human Services, this study does not constitute Human Subjects Research. All methods were carried out in accordance with relevant guidelines and regulations.

Genotype imputation

Genotype data were generated on 31 different genotyping arrays, with 96.8% of the data coming from four major families of arrays (HumanHapMap550/610Q, OmniExpress, OMNI2.5M, and the GSA array). Array versions from revisions of these arrays were merged on SNPs present on all arrays and filtered for genotype missingness (geno 0.1), individual missingness (mind 0.02), and minor allele frequency (MAF ≥ 0.01), in that order, using PLINK v1.965. Data were imputed using the TOPMed v2 reference panel on the TOPMed Imputation Server66,67,68. Each imputed file set was filtered for imputation quality on a combination of r-squared (R2) and MAF (for SNPs with MAF ≥ 0.05, R2 ≥ 0.3 were kept; for MAF < 0.05, R2 ≥ 0.5 were kept). File sets were merged, and variants present in 95% of samples were retained. Ancestry was assigned based on the results of principal component analysis (PCA). PCA was performed using flashpca on approximately 2.4 million imputed SNPs with MAF > 0.05 that had been pruned for linkage disequilibrium (LD) using PLINK v1.9, leaving 577,224 variants65,69. The first three principle components were plotted, and patients were assigned to either Asian, African/African American, European, or Hispanic/Latino ancestry by smallest distance to reference genotypes from the HapMap consortium70. No data regarding self-reported race or ethnicity was collected for or used in the current study.

Phenotype wide association study (PheWAS)

International Classification of Diseases 9 (ICD-9) codes were obtained from an anonymized extraction of the CHOP diagnosis database that contained subjects that had been recruited into the patient collection of the CAG. Counts of the occurrence of each ICD-9 code for each subject were generated, and the resulting table was converted into the PheWAS phenotype table. Subjects were included in the case group for each PheWAS phenotype if they possessed two or more occurrences of any of the ICD-9 codes that composed the phenotype in question. Subjects were listed as controls for the PheWAS phenotype if they lacked the case-defining ICD-9 codes, as well as ICD-9 codes corresponding to closely related phenotypes. Phenotypes were analyzed in the PheWAS if they were represented by 20 or more cases in the cohort. Quantitative traits were added to the phenotype table where lab test data were available. For measurements reported multiple times, mean values were calculated and used. Each quantitative trait was examined for normal distribution, and where skewing was in evidence, a log transformation was performed. The subject’s sex and age were included as covariates in the analysis, as were the 10 flashpca generated principle components and a variable representing the group in which genotyping array had been imputed. Genotypes were extracted from the imputed data as allele dose information to preserve some information regarding genotype probability, and the allele doses were used as the genotype inputs to the PheWAS. The PheWAS analysis was performed individually on each PCA-defined ancestry, and then a meta-analysis was performed combining all four ancestries. For the association tests, a logistic regression model was used for binary traits and a linear regression model was used for quantitative traits using p-value significance threshold of P = 2.3 × 10−5 after Bonferroni correction. To ensure appropriate statistical power with variable sample sizes, number of cases, case–control ratio, and minor allele frequency we only report sample sizes of 200 cases or more for binary traits and 1000 individuals or more for quantitative traits as calculated by Verma et al71.

Validation

The Ensembl variant effect predictor (VEP) was then used to assess the likely impact of these variants72. Furthermore, associations of interest were validated in other cohorts by querying the publicly available Open Target Genetics and Association to Function Knowledge Portal (AtFKP) databases73,74. The functional significance of SNPs of interest was validated by investigating their relationship to previously published methylation data obtained from CAG participants75.

The effect of the variants on gene expression was assessed by querying the GTEx database and refined using the USCF Genome Browser and the CAVIAR high confidence set50,51. (See acknowledgement section) Specifically, we conducted a colocalization analysis to identify the most likely causal SNP influencing IL1RL1 expression in lung tissue using GTEx v8 data. SNPs within 1MB of the IL1RL1 transcription start site (chr2:102,311,502–102,352,037) were analyzed using the CAVIAR software, which calculates the causal posterior probability (CPP) for each SNP. Given our focus on respiratory phenotypes, lung tissue was selected for testing.

Software used