Abstract
Aberrations in the capacity of DNA/chromatin modifiers and transcription factors to bind non-coding regions can lead to changes in gene regulation and impact disease phenotypes. However, identifying distal regulatory elements and connecting them with their target genes remains challenging. Here, we present MethNet, a pipeline that integrates large-scale DNA methylation and gene expression data across multiple cancers, to uncover cis regulatory elements (CREs) in a 1 Mb region around every promoter in the genome. MethNet identifies clusters of highly ranked CREs, referred to as ‘hubs’, which contribute to the regulation of multiple genes and significantly affect patient survival. Promoter-capture Hi-C confirmed that highly ranked associations involve physical interactions between CREs and their gene targets, and CRISPR interference based single-cell RNA Perturb-seq validated the functional impact of CREs. Thus, MethNet-identified CREs represent a valuable resource for unraveling complex mechanisms underlying gene expression, and for prioritizing the verification of predicted non-coding disease hotspots.
Similar content being viewed by others
Introduction
Both coding and noncoding elements can drive cancer and its resistance to therapy, but coding regions, which make up a mere 2% of our genome, are typically the focus of analysis. This is because of the high cost of whole genome sequencing and the fact that coding sequences are easily identifiable and can be directly linked to changes in gene expression1,2. In contrast, it is difficult to connect cis-regulatory elements (CREs) in non-coding regions to their target genes as these can be located many hundreds of kilobases away on the linear chromosome. Nonetheless, noncoding regulatory elements like promoters, enhancers, silencers and structural elements cannot be ignored since noncoding variants have been shown to be more likely to contribute to disease susceptibility than non-synonymous coding variants. Furthermore, noncoding regulatory elements occupy a greater proportion of the genome compared to coding sequences, and they alter the binding capability of factors that are the key drivers of gene regulation. In addition, and equally important, epigenetic changes that alter the ability of a transcription factor (TF) to bind a regulatory element can have the same effect3,4,5.
CREs are marked by active histone marks and DNA hypomethylation and are enriched for the binding of transcription factors. Distally located CREs rely on cohesin-mediated loop formation to bring them into physical contact with the promoters of genes they control6. The chromatin contacts can be stable (found in all cell types), or cell-type specific with contacts mediated by cell-type specific TFs, in a CTCF dependent or independent manner. Although gene regulation commonly occurs between a single CRE-promoter pair, transcriptional control can be complicated by CRE redundancy. Indeed, we and others have shown that enhancers can control the expression of more than one gene7,8, and similarly, promoters can act as enhancers that regulate other distally located target genes9. Thus, gene regulation can occur in ‘hubs’, which encompass multiple CREs and the promoters of their gene targets. CRE hubs connect regulatory elements that can be widely separated on the linear chromosome, but with interactions largely restricted to the same topologically associated domain (TAD), suggesting they rely on a loop extrusion mechanism for their formation10. Hubs are strongly enriched for promoters of target genes and super-enhancers critical for cell identity and are associated with high transcriptional activity indicating a probable important role in gene regulatory networks that control cell fate11,12,13,14.
Understanding the mechanisms by which genes are controlled is challenging for a number of reasons. First, as mentioned above, regulatory elements can be located hundreds of kilobases away on the linear chromosome and they do not necessarily control the nearest neighboring gene. Second, even though regulatory elements and their target genes are generally in close contact in 3D space, as a result of chromatin looping, not all elements that are in contact have a functional impact on gene regulation8. Third, although enrichment of active histone marks, DNA hypomethylation and transcription factors are all hallmarks of regulatory elements, their presence does not imply functional impact. Finally, there is no ‘one rule for all’, and every gene can be controlled by a combination of regulatory elements (enhancers / silencers) and unique chromatin folding constraints (mediated by binding of CTCF and cohesin). Thus, in order to better understand mechanisms underlying transcriptional control, we need to take advantage of datasets that couple gene expression with epigenetic marks to construct functional models.
DNA methylation is an epigenetic mark associated with the regulation of gene expression15. It has long been appreciated that methylated CpG islands (CGI) in the promoter of a gene have a strong silencing effect, but beyond that, the functional impact of methylation on gene expression is context-dependent. Intragenic methylation occurs during transcript elongation in gene bodies to prevent the initiation of spurious transcripts16,17,18, while methylation of intronic and intergenic regions impacts the activity of regulatory elements19. Bisulfite converted sequence-based techniques such as whole genome bisulfite sequencing (WGBS), reduced-representation bisulfite sequencing (RRBS) and array-based techniques like Illumina’s BeadChip, have facilitated the genome-wide screening of chromatin for methylation marks, allowing for a systematic investigation of their methylation status. The ENCODE20 and TCGA projects are particularly useful resources for identifying novel putative regulatory elements as they provide paired DNA methylation and gene expression data across multiple cell lines and patient-derived samples.
Several approaches have been proposed to investigate the role of methylation in gene regulation in a systematic way. These methods can be divided into two broad categories based on the manner in which they interrogate possible connections: (i) association mining and (ii) regression modeling. In association mining, all possible pairwise connections between candidate regulatory sites and genes are tested independently, while in regression modeling all possible associations are considered simultaneously. The association mining category includes methods like ELMER21,22 and TENET23, in which correlations between differentially methylated sites and differentially expressed genes in tumor versus normal samples, are tested. The regression modeling category includes methods like ME-Class24 and the use of Random Forests (RFs) to predict whether a gene will be differentially expressed based on changes in methylation status and other chromatin features25. However, a limitation of this method is that the analysis is restricted to promoters and gene bodies. Other regression methods like REPTILE26, use RFs to directly identify putative enhancers instead of modeling enhancer-gene associations. Aside from their strengths and weaknesses, all the above methods focus on a narrow region around the promoter and the gene body.
In this paper we present MethNet, a pipeline to uncover regulatory networks linking CpG sites to gene expression. In contrast to other methods, we identify regulatory elements within a 1 Mb region around the promoter of every protein-coding gene in the genome, taking into account the overall decay of regulatory connections related to the distance separation from a gene. The resulting regulatory network recapitulated known regulatory mechanisms like the silencing effect of methylation at gene promoters and the enrichment of CpG islands in CREs. Importantly, MethNet also identified potential CREs whose methylation was correlated with transcriptional activation or repression of their associated target genes. We characterized these networks and used them to identify cis-regulatory hubs involving multiple associations with a robust regulatory potential that are predictive of patient survival across all cancers as well as in particular tumor types. Promoter-capture Hi-C confirmed that highly ranked associations are mediated through physical interactions between CREs and their target gene promoters. Further, CRISPR inteference (CRISPRi) based single-cell RNA (scRNA) Perturb-seq confirmed that MethNet was able to accurately identify functional regulatory elements. Thus, MethNet is a powerful and cost-effective tool that can be used to understand the underlying mechanisms of distal gene regulation and to prioritize the verification of predicted non-coding disease hotspots.
Results
MethNet constructs regulatory networks using TCGA data
Most of the published epigenome-wide association studies (EWAS) have focused on links between differentially methylated regions and the expression of the closest gene. However, it is important to account for long-range interactions when modeling gene expression since regions that are distal on the chromatin fiber can be brought into close proximity by chromatin folding. Indeed, gene regulation occurs largely within highly self-interacting, megabase sized ‘topologically associated domains’ (TADs)27,28,29, that are separated by ‘insulating boundaries’ enriched for CTCF and cohesin. The boundaries are functionally important as they limit inter-TAD interactions, so that enhancers predominantly contact promoters within the same TAD. TADs form via ‘loop-extrusion’, with cohesin rings extruding DNA until they encounter two convergently oriented CTCF binding sites, which form the base of a loop30.
To account for this, MethNet considers as potential regulatory elements all the CpG probes that are within a 1Mbp radius of the TSS. It uses gene expression and methylation data from TCGA samples to construct predictive models that quantify the contribution of each site to a gene’s regulation (Fig. 1). TCGA is the largest resource with both genome-wide gene expression and DNA methylation data from the same patient samples and is therefore well-suited for this type of analysis. Gene expression is modeled separately for each cancer-type to account for context-specificity. Elastic-net regularization is used to restrict spurious associations. A consensus network was constructed by aggregating associations across all cancers, so that robust associations found in multiple cancers were ranked higher relative to cancer specific associations and thus are less likely to correspond to false-positive results. Finally, we computed the regulatory potential of every CRE as a function of all the associations it is involved in, and characterized the epigenetic features that contribute (Fig. 1).
MethNet identifies putative activating and repressing distal associations
The challenge for any CRE-discovery method is that the number of potential associations grows exponentially with the radius of their regulatory window (Fig. 2a). A-priori there are approximately 8 million potential regulatory associations between CpG sites in a window size of 1 Mb on either side of every protein coding promoter. MethNet uses an average of 400,000 (5%) associations per individual cancer to model the expression of every gene (more statistics are shown in Figure S1). A pan-cancer analysis indicated that there is strong context-specificity in the regulatory network, with up to one third (2.6 million) of all potential CREs having a putative functional impact in at least one cancer. The strength of these regulatory associations varied across different cancer types, with the majority of associations being specific to a particular cancer, consistent with lineage specific transcription factor-mediated gene regulation (Fig. 2b). The context-specificity of the regulatory networks identified are associated with varying degrees of accuracy and reliability across different cancer types. Notably, larger sample sizes enhance MethNet’s capacity to uncover robust and reliable associations (Fig. 2c). This observation underscores the potential scalability of MethNet and its ability to leverage larger datasets to further elucidate the complex regulatory mechanisms governing gene expression in cancer. The results of the MethNet pipeline are shared as supplementary data on figshare (see Data Availability).
MethNet successfully recovers known regulatory mechanisms such as the well documented effect of gene silencing when a promoter is methylated (Fig. 2d). Specifically, methylation of the first CpG island upstream of the TSS has a strong negative correlation with expression. Furthermore, MethNet identifies CpG islands as being more likely to associate with and have a stronger effect on gene expression than inter-island regions in an unbiased way31 (Fig. 2d). On average, the closer a CRE is to its target gene, the greater the probability of having an impact on transcriptional output. Importantly, however, the probability of CRE-target gene association does not diminish to zero, indicating that CRE methylation beyond the immediate vicinity of the promoter can have an impact on gene expression in a context-specific manner.
We define MethNet associations that have a negative or positive coefficient as activating or repressing, respectively. In activating associations, a gain of methylation is predicted to decrease gene expression, while in repressive associations, a gain of methylation is predicted to increase gene expression. For example, MethNet identified a CTCF binding site located 250 kb upstream of the IFNγ promoter whose demethylation is linked to transcriptional repression (Fig. 2e and Supplementary Fig. S2). This suggests that CTCF binding to the unmethylated DNA sequence could be acting as an insulator preventing the IFNγ promoter from coming into contact with elements that activate its expression. Silencing of this inflammatory cytokine could have profound effects on immune responses to cancer, and its regulation is thus important in the context of immunotherapy. In contrast, MethNet identified the promoter of a non-coding gene located downstream of GSTT1 that when demethylated activates Glutathione S-transferase1 expression. This suggests that the promoter of the non-coding gene may be acting as an enhancer for GSTT1 (Fig. 2f and Supplementary Fig. S3). Glutathione S-transferases (GSTs) are phase II metabolizing enzymes that play a key role in protecting against cancer by detoxifying numerous potentially cytotoxic/genotoxic compounds. These two examples highlight the complex interplay between DNA methylation and gene expression, providing valuable insight into the mechanisms underlying the regulation of cancer relevant genes.
The regulatory potential of CREs is correlated with chromatin context and contact frequency
The number of potential CREs controlling transcriptional output varies by gene, and the number of genes controlled by a single CRE varies by element. To study individual CREs and rank them by their impact on transcriptional output we defined a metric to capture the intrinsic regulatory potential. In particular, we aggregated a CRE’s contribution to the regulation of all genes in its vicinity. We quantified the importance of an association as the excess of its relative effect size with respect to a null model, where all elements have the same intrinsic potential and the only factor differentiating them is their distance to the gene promoter. The potential of a CRE was calculated by summing all the distance-adjusted contributions to genes in its vicinity (Fig. 3a).
We next conducted a series of enrichment analyzes to uncover distinctive characteristics that are linked to a CRE’s regulatory potential. First, we analyzed the chromatin state of each CRE using ChromHMM32 (Fig. 3b). These investigations indicate that promoters acting as enhancers controlling other distal genes have the highest overall regulatory potential. Our observations are in line with the known regulatory role of distal gene promoters9. Indeed, the associations detected by MethNet transcend linear distance, highlighting its capacity to identify distal regulatory networks.
Methylation has been shown to directly affect the presence of the transcriptional machinery, chromatin modifiers, and binding of transcription factors (TFs) whose chromatin occupancy in turn, can act as a barrier to methylation3,33,34,35. To identify which factors could contribute to the regulatory potential of a CRE, we performed an enrichment analysis and identified members of the RNA polymerase complex (such as POLR2A and POLR2G) as the most enriched factors (Fig. 3c). Additionally, we identified other highly enriched factors known to be involved in chromatin remodeling and altering the methylation status, including EGR1, HDAC1, and PHF8. Depletion of EZH2 was also linked to increased regulatory potential which make sense as EZH2 is part of the PRC2 complex which characterizes inactive chromatin. Moreover, using Hi-ChIP data from the benchmark study of Bhattacharyya et al.36, we observed a strong positive correlation between the regulatory potential of a region and the number of active chromatin loops (H3K27ac loops) anchored at it, further highlighting the dynamic interplay between chromatin structure and regulatory activity (Fig. 3d).
MethNet hubs that control multiple genes have an impact on patient survival
The distribution of the regulatory potential is long tailed, suggesting the existence of CREs with exponentially high regulatory potential (Fig. 4a) linked to multiple genes (Supplementary Fig. S4a, b). Intriguingly, these elements exhibited low methylation variance across the different TCGA datasets, indicating that they could be under robust and stringent regulation. We used the elbow method to show that above a select threshold, CREs were significantly enriched for regulatory associations. This approach identified 6137 CREs, that we refer to as ‘MethNet hubs’, since, as expected, their profile aligns with a model in which CREs exert control over multiple genes (Fig. 4b). A survival analysis, using the Cox proportional hazard model, across TCGA cancers with clinical data, revealed that methylation of MethNet hubs has a bigger impact on the overall survival of patients in comparison to non-hub elements of similar variance, both in a pan-cancer and cancer-specific context (Fig. 4c and Supplementary Fig. S4c). This finding provides evidence for MethNet hubs having a pivotal role in the context of cancer biology and underscores their potential clinical relevance.
To gain insight into the characteristics of MethNet hubs, we repeated the previous enrichment analyzes, this time comparing hubs to other CREs with positive potential (Fig. 4d–f). As anticipated, we observed that hubs share many of the characteristics exhibited by high-ranking non-hub elements, however, key differences emerged upon closer examination. First, hubs are more likely located in open chromatin regions than non-hub CREs (Supplementary Fig. S4d, e). Next, in the ChromHMM and binding site analysis, insulating elements and CTCF were respectively enriched in MethNet hubs compared to non-hubs (Fig. 4d, e). This finding is consistent with the known role that insulating elements play in gene regulation via chromatin looping and insulated topologically associated domain (TAD) boundaries37,38. This hypothesis is further supported by our data showing that hubs are depleted in elements lacking chromatin loops (Fig. 4f). In summary, we identified MethNet hubs that are characterized by high-ranking regulatory potential and whose methylation status has low variance. Hubs are enriched for regulatory associations and insulating elements and show significant clinical relevance with respect to cancer patient survival.
MethNet hubs uncover known and potentially novel regulatory elements
Examples of two regulatory hubs are shown in Fig. 5. The regulation of the Protocadherin alpha (PCDHA) cluster of genes has been shown to be controlled by a regulatory hub (HS5−1) that overlaps a CTCF binding site, which stochastically activates different PCDHA genes by cohesin-mediated looping39,40. Using MethNet we were able to unbiasedly identify this regulatory hub (highlighted in blue). We also found another hitherto unknown regulatory hub that is enriched for H3K27ac and H3K4Me3 (highlighted in orange) upstream of PCDHA. This hub has predicted regulatory associations with genes from all three clusters of the Protocadherin family, PCDHA, PCDHB and PCDHG, and the region has been characterized as a schizophrenia risk locus that is linked with the regulation of all three protocadherin families in the context of brain41. In-situ Hi-C data reveals the high contact frequency of this CRE with all three Protocadherin families (as shown in red in the Hi-C heatmap).
High scoring MethNet associations are mediated by long-range chromatin interactions
To determine whether distal CRE associations identified by MethNet are brought into contact with their target genes by chromatin looping, we performed a promoter-capture Hi-C experiment that identified chromatin interactions from all promoters in the genome in two distinct well characterized A549 and K562 cell lines. The quality control for the promoter-capture Hi-C is shown in Supplementary Fig. S5. Promoter-capture Hi-C, which enriches for loops anchored at gene promoters, allows us to simulate parallel 4C-seq experiments and recover regions that are in physical contact with each promoter. Given that MethNet consists of common and cell type specific associations, we would not expect all associations from our pan cancer analysis to be validated with the promoter-capture Hi-C data from the A549 and K562 cell lines. Indeed, an example of multi-locus interactions for the TP53 gene promoter in A549 and K562 shown in Fig. 6a, highlights the cell type specific chromatin contacts found in the two cell lines.
To determine whether associations with higher scores are more likely to be facilitated via chromatin interactions, we used the union of loops from the A549 and K562 cell lines. We found a robust correlation between association score and probability of loop formation across all levels (Fig. 6b). This data demonstrates that stronger MethNet associations are more likely to act via chromatin contacts, while weaker connections may represent indirect effects.
Next, we investigated whether the regulatory potential of MethNet is predictive of chromatin hubs, i.e. CREs that form multi-locus loops42,43. Remarkably, MethNet’s regulatory potential demonstrated a high predictive power for identifying multi-locus loops, achieving an area under the receiver operating characteristic curve (AUC) of 86%, when applying the strictest criteria (Fig. 6c). Although our analysis was primarily focused on promoter hubs (due to the experimental bias of promoter-capture Hi-C), the predictive power of MethNet extended to non-promoter regions when less stringent criteria were used for calling hubs (maximum AUC 84%, Supplementary Fig. S6). In sum, our data indicate that high-ranking distal MethNet associations are brought into close physical proximity by long-range chromatin interactions.
Perturbation of MethNet hubs results in altered target gene expression
To functionally validate the predictions generated by MethNet, we performed perturb-seq that combines targeted perturbation of genomic regions with single-cell RNA sequencing (scRNA-seq). Compared to a regular CRISPR interference (CRISPRi) assay, perturb-seq enables the simultaneous investigation of multiple genomic regions, using a pool of guide RNAs (sgRNAs). For the CRISPRi, we used the dCas9-KRAB-MeCP2 system that blocks the binding of other factors and methylates the DNA as well as histones of the targeted region44, inducing robust silencing. The transcriptomic readout aligns well with the MethNet pipeline, which predicts changes in the expression of target genes. Furthermore, perturbation of distal regulatory elements is more likely to lead to subtle gene expression changes rather than cell death45, particularly when there is more than one CRE controlling a gene target. Supplementary Fig. S7 shows the percentage of A549 cells transfected with dCas9-KRAB-MeCP2 at day 14 after puromycin selection.
In total, we targeted 55 potential regulatory elements with 2 to 5 guides each. To address the inherent limitations of the assay, targets were selected based on the following criteria: (i) CREs were unmethylated in A549 cells to allow for methylation by the dCas9-KRAB-MeCP2, (ii) 2 to 5 high-quality guides could be selected using the CRISPick46,47 scoring system, and (iii) putative target-genes were expressed at levels detectable by scRNA-seq. An outline of the perturb-seq validation experiment is shown in Fig. 7a.
Although we designed our experiment so that each cell received a single guide, some cells contained multiple guides (Supplementary Fig. S8). Therefore, we used a linear model with a complex design matrix to deconvolve the individual effects of each guide on gene expression, similar to the approach used by Dixit et al.48. The results of this analysis are illustrated in Fig. 7b. Among the 55 targeted CREs, 17 were validated as being involved in 22 functional associations predicted by MethNet. To assess the significance of the identification of 17/55 functional CREs, we performed a bootstrap analysis, by randomly shuffling the sgRNA labels across cells, while maintaining the total number of detected guides. We generated 20,000 bootstrap samples to estimate the null distribution (Fig. 7c) and estimated that the probability of detecting 17 or more regulatory regions was highly unlikely to have arisen by chance (p = 0.0004). This statistical evaluation supports the robustness and significance of the regulatory regions identified through perturb-seq.
Out of the 17 functional CREs identified, 4 were found to be associated with more than one target gene. In Fig. 7d we highlight a regulatory hub that corresponds to the promoter of BNIP2 (highlighted in orange). BNIP2 itself is not highly expressed so it was not captured by the perturb-seq, but expression of two predicted target genes, GCNT3 and ANXA2 (red loops) were found to be significantly downregulated. Both of these genes are associated with poor prognosis in multiple cancers. ANXA2 is a member of the calcium-mediated phospholipid-binding protein family of annexins, involved in epithelial mesenchymal transition, cell proliferation and survival49, while GCNT3, is a member of the N-acetylglucosaminyltransferase family that is associated with cell proliferation, migration and invasion in non-small-cell lung cancer50. Our data indicate that the hub corresponds to a promoter region that is predicted to act as an enhancer for GCNT3 and ANXA2, and thus methylation leads to a drop in their expression as we observed. We also identified two genes, MYO1E and LDHAL6B that are predicted by MethNet to be targets of the hub (highlighted by gray loops in the screenshot) that we were unable to validate. Although both gene targets are expressed and unmethylated in A549 cells, unlike ANXA2 and GCNT3 they were not connected to BNIP2 by chromatin loops in our promoter-capture Hi-C. Chromatin looping is likely to be important for long-range hub mediated regulation, and we speculate that in the case of these two target genes, contacts are regulated in a cell type specific manner. Other MethNet associations validated by the perturb-seq experiment are shown in Supplementary Fig. S9. These include the target gene AMIGO2, a cell adhesion protein that is linked with cell survival and metastasis of multiple adenocarcinomas51,52 GFBP4, a tumor suppressor acting as double-negative feedback in AKT and EZH2 signaling53 (Supplementary Fig. S10). Taken together, these data confirm the robustness of the MethNet pipeline in predicting regulatory associations.
Discussion
Here we introduce MethNet, a pipeline that combines gene expression and methylation data from the same TCGA cancer samples to identify regulatory elements that can control genes beyond their immediate genomic vicinity. MethNet’s unbiased approach led to the identification of numerous regulatory features commonly associated with the role of methylation in gene expression. In addition, it revealed the existence of previously unknown regulatory elements with potential clinical significance. MethNet’s most intriguing attribute is the ability to uncover the presence of regulatory hubs that can influence the expression of multiple genes. These hubs displayed the expected characteristics of previously identified hubs10, such as enrichment in active chromatin marks and chromatin looping connections between hub CREs and their target genes. MethNet also revealed hub CREs, including a relatively high proportion of insulator elements compared to regular CREs and low methylation variance, suggesting that hub regulation is influenced by TAD structure and that they are tightly regulated. Moreover, the methylation status of hubs showed a significant correlation with overall patient survival as well as cancer-specific survival.
It should be noted that our modeling of epigenetic context was limited by the availability of data. The DNA methylation profiles provided by TCGA were generated using the 450k array, whereas larger 850k arrays are now widely used. Furthermore, TCGA lacks other relevant data modalities, such as chromatin accessibility and protein binding profiles, that are important for a more complete understanding of the regulatory landscape. As the volume and diversity of genomic data expand, MethNet can be adapted to incorporate additional data modalities, further enhancing its capacity to unravel the complexities of gene regulation.
Any method that identifies regulatory elements by linking gene expression with methylation must deal with the problem of spurious correlations. This issue is exacerbated by the fact that we consider long-range (up to 1 Mbp) associations and that nearby methylation sites tend to be correlated and can act synergistically. To address the confounding effect of correlated methylation sites, we clustered probes within 200 bp into a single variable. Clustering neighboring CpG sites is standard procedure for smoothing technical noise and reducing biological artifacts, such as genetic variants that destroy CpG sites. Clustering increases the power of the analysis without losing information, since methylation of proximal CpGs is highly correlated. In addition, it is well documented that methylation changes are found in differentially methylated regions (DMR) typically spanning ~100–1000 bp regions. The 200 bp window is a standard size for CpG clustering, as it encompasses both typical transcription factor binding sites and the length occupied by histones. We evaluated the performance of the clustering using three metrics: mean cluster size, number of clusters and coefficient of variation (Supplementary Fig. S10).
Although 1Mbp range interactions are important for gene regulation, testing all possible promoter-CRE pairs is prone to produce high false discovery rates. We addressed this issue by combining data across genes and cancers in a statistically principled manner. In particular, we used elastic-net regression, tuned with cross-validation, to promote sparsity within every cancer and then pooled the resulting associations across cancers based on their predictive strength, while accounting for known confounding factors (see Methods: MethNet score). This multi-level approach significantly reduced the number of identified CREs per gene compared to a naïve analysis of variance with lasso-penalty (Supplementary Fig. S11).
Previously developed methods try to mitigate the problem of spurious correlations by limiting the range of associations and by using permutation or cross-validation techniques. Our modeling approach is similar to that of Methylation-eQTL in that it also uses TCGA data and penalized regression to identify regulatory associations. Both methods assume a linear additive model, which is a pragmatic choice given the size of the currently available data set. However, while Methylation-eQTL limits its scope to a 500 kbp window around the gene and uses a sequential lasso approach to deal with promoter sparsity in individual tumors, we analyzed a 1 Mbp window around each gene and employed elastic-net regularization, which leads to a less sparse solution and combats spurious CRE associations by aggregating results across multiple cancers. Another important difference between the two approaches is that MethNet, in addition to focusing on individual CRE-promoter connections, uncovers regulatory hubs and highlights their relevance in the context of normal gene regulation and cancer. In contrast to Methylation-eQTL, which only performed cross-validation using an independent data set, our experiments provide causal and mechanistic support for our findings, by validating the functional and physical connections between hub CREs and their target genes.
Our motivation in this study, was to uncover interactions that are robust across multiple cancers rather than cancer-defining CRE-gene interactions. There have been several studies that identify tissue-specific methylation54,55 and gene markers56. MethNet does not aim to recover all cancer-specific regulatory associations but instead identifies a core, robust set of dynamic associations that are recurrent across tissues and revealed by cancer deregulation. These core CREs are likely to be important for carcinogenic processes in general, regardless of the tissue type, and they could be useful for understanding mechanisms underlying cancer and for identifying targetable hotspots. We, therefore, treated each TCGA-cancer study independently and normalized the coefficients to produce dimensionless units that are comparable across cancers. Cancer-defining CRE-gene interactions were detected as low confidence associations which were effectively ignored because our focus was to identify robust cross-cancer CREs.
MethNet identified 37,740 distal regulatory elements, including the regulatory hub in the PCDHA cluster. This hub mapped to a GWAS (genome-wide association study) schizophrenia risk locus, supporting the functional relevance of the pipeline. While GWAS studies have identified more than 100,000 disease-associated SNPs (single nucleotide polymorphism) over the past decades57, the identification and prioritization of causal germline SNPs remains challenging due to linkage disequilibrium and the size of the non-coding genome. Nonetheless, identifying causal SNPs is of paramount importance in understanding the underlying mechanisms of disease susceptibility and leveraging GWAS data. Towards this end, MethNet provides a valuable resource for prioritizing the verification of predicted non-coding disease variants that are likely to disrupt regulatory elements versus the numerous ‘proxy’ variants in high linkage disequilibrium that share the same disease-association statistics, but have no functional effect.
The results from the Pan-Cancer Analysis of Whole Genomes (PCAWG) and TCGA consortiums, which include more than 2500 cancer samples, were somewhat disappointing as only a few non-coding somatic driver mutations were identified58. This can be partly explained by the fact that the statistical approaches to identify coding driver mutations are not tailored to the non-coding genome. Furthermore, it has been shown that methods based on functional screening followed by experimental validation are better able to identify bona fide non-coding driver mutations59, compared to approaches that focus on the accumulation/recurrence of non-coding mutations. In this context, MethNet, in addition to being able to prioritize candidate non-coding driver mutations, can also identify the target genes of disrupted regulatory elements, which is crucial for developing novel therapeutic drugs.
Many studies have pointed out that regulation is more complex than simple enhancer-promoter loops9. For example, super-enhancers (SEs), which are large regions (around 8 kb) characterized by strong enrichment in chromatin marks like Med1 or H3K27ac60, are associated with highly expressed, tissue-specific genes61. In addition, there are CREs with multi-locus contacts that influence the expression of numerous genes. Our analysis of the consensus regulatory network revealed the existence of both individual CRE-promoter contacts, as well as MethNet hubs that have an outsized influence on gene expression.
In this study, we took advantage of a high-throughput perturb-seq assay that combines CRISPRi screening with scRNA. This assay is well suited to the functional screening of distal regulatory elements since it allows the identification of relatively modest gene expression changes upon perturbation of these elements. While coding mutations or disruption of gene promoters can lead to dramatic changes in cell fitness, it has been shown that the regulatory potential of distal elements might be more subtle, although these are likely to be important in disease processes.
Using perturb-seq, we were able to demonstrate that about one-third of the targeted regulatory hubs were associated with transcriptional changes in cancer-related genes, suggesting that a substantial proportion of MethNet hubs contribute to the cancer cell phenotype by activating or silencing oncogenic and tumor-suppressor genes, respectively. In addition, the validated MethNet associations were not trivial, but linked genes to regulatory elements over long distances (mean distance 366 kbp) bypassing multiple potential regulatory elements in between. Among these validated MethNet hubs, we observed expression changes in at least one of the predicted target genes, while in others no alterations were detected. This could be because: (1) Gene expression levels could be below the threshold detectable by scRNA-seq, and (2) MethNet hubs are defined using a pan-cancer approach, and while this allows us to prioritize the most robust candidates, it does not rule out context-specific gene regulation, as we highlight in Fig. 7, where predicted hub-target gene associations did not overlap with chromatin loops in the cell line analyzed. Therefore, although MethNet is useful in identifying high-confidence regulatory hubs, studying their gene regulation network in a tissue-specific manner is also important.
In conclusion, MethNet represents a powerful computational framework for the integrative analysis of DNA methylation and gene expression data. Our study demonstrates the effectiveness of MethNet in identifying regulatory associations across multiple cancer types as well as context-specific connections, highlighting the importance of functional integrative analysis over simple correlations. The performance of MethNet scales with data, indicating its potential for further expansion with larger datasets and inclusion of other data modalities. Identification of previously unreported hubs and their association with clinical outcomes sheds light on the intricate interplay between chromatin structure and global transcriptional regulation. Overall, MethNet represents a valuable resource for deciphering the complex regulatory mechanisms underlying gene expression in cancer, and for prioritizing the validation of germline and somatic non-coding disease-associated variants.
Methods
Promoter Capture HiC sample and library preparation
Promoter Capture Hi-C data was generated in K562 and A549 cell lines using the Arima Capture-HiC+ Kit (catalog number: A301010, including, the Arima Promoter Capture Module, and the Arima Library Prep Module according to the Arima Genomics manufacturer’s protocols. The A549 and K562 cell lines were purchased from ATCC (catalog number: CCL−185 and CRL-3343, respectively). Two replicates of the Hi-C were performed in each cell line, and for each replicate 1 million cells were collected and double cross-linked using 3 mM DSG (disuccinimidyl glutarate), followed by 1% formaldehyde. Samples were sequenced with Novaseq Illumina technology according to standard protocols with around 300 million (150 bp paired-ends) reads per sample. The library preparation and sequencing were conducted by NYU Langone’s Genome Technology Center.
Perturb-seq sample and library preparation
CloneTracker XP CRISPR Barcode pooled lentiviral libraries expressing barcoded sgRNAs, the puromycin selection gene as well as an RFP reporter were purchased from Cellecta® (Catalog number: custom library, CPLVSGL-P; lentiviral packaging service, CLVP-V). The plasmid (pGC02-EFS-KRAB-dCas9-MeCP2-2A-Blast) was a gift from Dr. Neville Sanjana. The A549 cells were transduced with lentiviruses expressing dCas9-KRAB-MECP plasmids as described in Yeo et al.42. Cells were then grown in DMEM medium (Gibco/Invitrogen) +10% FBS +100 units/ml penicillin +100 μg/ml streptomycin +5% CO2 at 37 °C.
In total, we targeted 55 potential regulatory elements with 2 to 5 guides each. To address the inherent limitations of the assay, targets were selected based on the following criteria: (i) CREs were unmethylated in A549 cells to allow for methylation by the dCas9-KRAB-MeCP2, (ii) 2 to 5 high-quality guides could be selected using the CRISPick44,45 scoring system, and (iii) putative target-genes were expressed at levels that were detectable by scRNA-seq. An association was considered detectable if the gene was highly expressed and MethNet predicted that methylation would lead to silencing or when the gene is lowly expressed and MethNet predicts that methylation leads to expression. We also included 3 partially methylated CREs and 2 with no detectable targets in A549 as controls (Supplementary Data 1: Perturb-seq design – Sheets for CRE and Gene criteria).
The 248 sgRNAs targeting high-confidence MethNet hubs and 5 non-targeting sgRNAs (negative controls) were designed using CRISPick44,45 for the Human GRCh37 (hg19) assembly using parameter settings: CRISPRi, SpyoCas9/Chen (2013) tracrRNA (Supplementary Data 1: Perturb-seq design – Sheet for sgRNA Sequences). The lentiviral libraries were transduced into the A549-dCAS9-KRAB-MeCP cells according to Cellecta® protocols. Briefly, 105 cells/well were seeded into a 6 well-plate. The optimal MOI of viral particles (to reach ~30–40% of infected cells) was added. On day 3, 2 ug/ml of puromycin was added, resulting in >90% transduced cell selection as confirmed by cytofluorometry using RFP (Supplementary Fig. S7). Cells were expanded under puromycin selection until day 14. For scRNA-seq using the 10X Genomics technology, 25,000 cells were harvested. For optimal multiplet detection and optimal signal-to-noise ratios, cells were hash-tagged using 5 cell multiplexing oligos purchased from 10X Genomics (catalog number: 1000261, 1000262, 1000243, 1000242). The sequencing library was, then, prepared using the Chromium Next GEM Single Cell 3ʹ Kit (catalog number: 1000268, 1000120 and 1000215) according to the manufacturer’s protocol and sequenced by NYU Langone’s Genome Technology Center.
Statistics and reproducibility
No samples were excluded from the analyzes, unless they contained missing data. We limited our analysis to protein coding genes and cancer studies with at least 100 samples with matching RNA-seq and DNA methylation data. In total, we used 8,264 samples. We used GENCODE v30 gene annotation62 and the Illumina CpG probe coordinates for hg19.
Gene expression modeling
We constructed a gene-probe adjacency network by connecting a gene with all probes within 1Mbp on either side of its transcription start site (TSS). This resulted in a network of 13 M interactions between 20k genes and 450k probes. We collapsed probes that were within a diameter of 200 bp (complete linkage) into probe clusters by averaging their beta value and linking them with all the genes interacting with the original probe set. This resulted in 300k CpG clusters, of a single probe or more, and a new network of 8 M interactions.
We fitted a linear model for each gene and TCGA cancer independently, removing gene-cancer pairs with low variance (standard deviation less than 1). The variables of the model included all the methylation clusters neighboring the gene in the adjacency network as well as the sample type (tumor or metastatic vs normal) wherever there were multiple sources. To promote sparsity and better generalization in our models we used elastic net regularization to limit the number and effect size of the cluster-gene interactions using the model specification:
Here i, g and c are indexes for the sample, gene, and cluster, respectively. ygi is the log-normalized expression of g in sample i, βg0 is the basal level expression for sample i given its clinical profile zi (tumor or normal), βgc and xic are the coefficient and methylation status (beta value) of cluster c respectively, and kg is the number of clusters neighboring g. ϵgi is the error term of the model. The R package glmnet63 was used to fit the models and determine the trade-off (λ, α) between accuracy and sparsity via 10-fold cross validation. This process resulted in a series of regulatory networks, one per cancer, connecting genes with methylation clusters if in the corresponding gene model, the cluster had a non-zero coefficient.
For our pan-cancer analysis, we combined the interaction coefficients across all cancers by averaging across all cancers weighted by the corresponding model’s performance as measured by R2.
MethNet score
To quantify the contribution of each gene-cluster interaction to overall gene expression, we calculated a contribution score based on the following function:
In contrast to the coefficient (βgc), which quantifies how much the expression of gene g is altered if cluster c switches from a completely unmethylated to a completely methylated state, the score (bgc) intends to capture the information gained by using MethNet’s regulatory network instead of a naive model where all neighboring elements contribute equally (1/kg). Scores were computed based on the consensus network (pan-cancer analysis).
Finally, to identify the intrinsic potential of a regulatory element, we regressed out the effect of the distance (dgc), which acts in an element agnostic manner, from the MethNet score using a GAM model with the absolute distance between TSS and methylation cluster as the only predictor, and taking the residuals. Thus, for the purposes of characterizing regulatory hubs we assume a distance-based, instead of an equivalence contribution of each region. To fit the GAM model f(dgc), we excluded associations that overlap the gene body or the promoter since they are mediated via different mechanism of action and can would confound the effect.
The MethNet potential of a cluster is defined as the sum of scores of all the interactions it’s part of:
In total, we computed the MethNet potential for 245,555 CRE candidates. MethNet association and CRE scores are provided as supplementary data methnet.csv and cluster_score.csv at figshare (see Data Availability).
Regulatory effect as a function of distance
We analyzed the relationship between MethNet associations and distance to gene separately for CpG sites located within and outside the gene body.
For interactions occurring outside the gene body, we aggregated CpG clusters into two categories: CpG island and non-island regions. Subsequently, for each gene, we separately ranked these categories based on their distance to the gene body, where −1 is the closest upstream region, −2 the second closest etc, and accordingly 1, 2 for upstream regions. Two metrics were used: the mean coefficient of interaction and the probability of interaction, which were calculated for all clusters within the respective region.
Enrichment of regulatory potential
To assess the enrichment of regulatory potential, we performed an overlap analysis between CpG clusters and ChromHMM states. A linear model was fitted using all 245511 CREs, with the Low Signal state serving as the reference baseline. In this context, the enrichment score is the difference between the average regulatory potential of clusters overlapping a ChromHMM state versus the Low Signal state. Similarly, we conducted a similar analysis for transcription factor binding sites, allowing for the possibility of multiple factors binding to a single site to account for confounding effects. In this case, the basal state represents unbound chromatin, and the enrichment score is the average difference in regulatory potential of clusters bound by a specific TF versus those that are unbound.
The resolution of the H3K27ac loops was 2.5 kb. We overlapped anchors with CRE candidates and filtered out CREs that overlap promoters, defined to be with 2000bp of any TSS, to focus on enhancer elements. In total, we analyzed 166,552 CRE candidates grouped into 5 groups based on the number of loops: 0, 1, [2, 4), [4, 13), [13, 209). A linear regression model was fit to estimate the enrichment score.
To assess the enrichment of hubness, we repeated the same process but instead of a linear model, we fitted a logistic regression model to predict hub vs non-hub among all the regulatory elements with positive regulatory potential.
Survival analysis
We excluded from this analysis elements with low methylation variance (bottom 25%) and from the rest we selected all the hubs (730) and 16190 at random. A Cox proportional hazard model on overall survival (OS) was fitted for each element using the survival R package64. The methylation status (beta) was used as a predictor and we fitted a varying-slope model: coxph(Surv(OS.time, OS) ~ beta*strata(cancer), to estimate both the mean effect of methylation across all cancers (Fig. 4c) and the cancer specific effect (Supplementary Fig. S4c for coefficients with less than adjusted p-value 0.05).
Analysis of promoter capture HiC
We called loops using the Arima pipeline [https://github.com/ArimaGenomics/CHiC] with default parameters: which is based on HiCUP65 and CHiCAGO66. We ran the analysis independently for the K562 and A549 cells and then used the union of loops for subsequent investigations. Quality control metrics and loop replication are shown in Supplementary Fig. S5.
The resolution of loop anchors was 5 kb, so we only considered associations of length 10 kb and above. We considered an association overlapping a loop if loop anchors overlapped with both the regulatory element and the gene. Enrichment was computed on the basis of associations: associations were grouped into bins and a logistic regression was used to compute their probability to overlap with a called loop.
To estimate the predictive power of the MethNet potential to call hubs, we first computed the number of loops anchored at each bin and then we computed the potential of the bin by summing the potential of all the clusters contained in it. Finally, we called hubs for different thresholds and estimated the predictive power of the bin’s potential using the AUC of the ROC curve using pROC packages.
Analysis of perturb-seq
Cells were called and analyzed using the 10x Genomics Cell Ranger 7.0.0 for initial cell calling and protospacer calling. The results were further filtered based on the percentage of mitochondrial reads and total number of genes, to remove lysed cells, and based on the HTO tags to remove duplicates using the hashedDrops function of the DropletUtils package67 and manually filtering to remove what appeared to be triplets per drop. After these filtering steps we ended up with 36,601 cells. The distribution of the number of expressed genes and number of reads per cell as well as the number of guides per cell are shown in Supplementary Fig. S8.
Subsequent analysis was based on the method suggested by Dixit et al.48. In particular, we focused only on genes that were within 1 Mb of a targeted region (as this is the maximum radius of MethNet association) and genes with a mean normalized expression above 0.0002. Since most cells had more than one sgRNA guide we analyzed the results for all the targets simultaneously. In particular, we identified sgRNA-gene interactions by fitting a linear model log(Y)∼Xβ, where log(Y) are the log-normalized counts (logNormCounts) and X is a binary matrix where Xij = 1 if the guide j is detected in the cell i. The models were fitted using the limma package68 with Bayesian shrinkage, and interactions were called using a threshold of 0.05 on the adjusted P-value. All recovered associations are shown in Supplementary Fig. S9. For the bootstrap analysis, we shuffled the rows of X independently for each column and refitted the model. The metric we used to quantify the performance of MethNet was the number of targeted regions forming at least a single interaction at the 0.05 threshold. Since the confidence interval is affected by the number of cells transfected with each guide, this process controls for spurious interactions due to the differences in the cell base of each guide.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
The results published here are in part based upon data generated by the TCGA Research Network [https://www.cancer.gov/tcga]. We collected paired gene expression and methylation data via the xenahubs portal69 which downloaded data from gdc.cancer.gov (data release 9.0 - October 24, 2017). We used the pan-cancer batch-corrected normalized gene expression [https://www.synapse.org/#!Synapse:syn4976369] and beta values for methylation from Illumina’s HumanMethylation450 BeadChip [https://www.synapse.org/#!Synapse:syn4557906]. Clinical data for those samples was downloaded from xenahubs where available [https://www.synapse.org/#!Synapse:syn8402823]. Metadata for the CpG probes were collected from Illumina’s annotation of HumanMethhylation450 BeadChip via the IlluminaHumanMethylation450kanno.ilmn12.hg19 Bioconductor package. The annotation was augmented (using the custom script annotate_clusters.R) by overlapping clusters with tracks from the UCSC genome browser70. We used the ChromHMM chromatin annotation, CG island, and the transcription factor binding site cluster tracks. Links for all the data downloaded for this annotation are included in the custom script. Annotations were based on the most common labeling across all cell types. DNAse data for K562 cells were downloaded from ENCODE, we used the DNase regions of the combined replicates (file id ENCFF621ZJY). Hi-ChIP loops were downloaded from the supplementary material of the FitHiChIP paper34. We used the combined loose and merged replicate (L + M) loops for 2.5 kb bins for all cell lines (CD4-Naive, GM12878, K562). The primary data for Hi-ChIP loops were generated by Mumbach MR et al.71 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE101498]. The manually processed data are shared at the GitHub repository of MethNet (see Code Availability) as Bhattacharyya_loops.csv.gz. The raw and processed sequencing data generated in this study have been submitted to the Gene Expression Omnibus (GEO) database under the superfamily accession number GSE236305. The promoter capture Hi-C and Perturb-seq data accession numbers are GSE235851 and GSE236304, respectively. The results of MethNet analysis used to generate the figures are uploaded to figshare [https://doi.org/10.6084/m9.figshare.25988074.v3]. No previously published data are under restricted access. Source data are provided with this paper.
Code availability
The code used to generate the figures is available at https://github.com/TeoSakel/MethNet. Zenodo https://doi.org/10.5281/zenodo.11404065.
References
Shen, H. & Laird, P. W. Interplay between the cancer genome and epigenome. Cell 153, 38–55 (2013).
Iranzo, J., Martincorena, I. & Koonin, E. V. Cancer-mutation network and the number and specificity of driver mutations. Proc. Natl Acad. Sci. USA 115, E6010–E6019 (2018).
Yin, Y. et al. Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017).
Ahmed, M. et al. CRISPRi screens reveal a DNA methylation-mediated 3D genome dependent causal mechanism in prostate cancer. Nat. Commun. 12, 1781 (2021).
Zeng, Y. et al. DNA methylation modulated genetic variant effect on gene transcriptional regulation. Genome Biol. 24, 285 (2023).
Snetkova, V. & Skok, J. A. Enhancer talk. Epigenomics 10, 483–498 (2018).
Proudhon, C. et al. Active and inactive enhancers cooperate to exert localized and long-range control of gene regulation. Cell Rep. 15, 2159–2169 (2016).
Hewitt, S. L. et al. Association between the Igk and Igh immunoglobulin loci mediated by the 3′ Igk enhancer induces ‘decontraction’ of the Igh locus in pre–B cells. Nat. Immunol. 9, 396–404 (2008).
Medina-Rivera, A., Santiago-Algarra, D., Puthier, D. & Spicuglia, S. Widespread enhancer activity from core promoters. Trends Biochem. Sci. 43, 452–468 (2018).
Uyehara, C. M. & Apostolou, E. 3D enhancer-promoter interactions and multi-connected hubs: organizational principles and functional roles. Cell Rep. 42, 112068 (2023).
Di Giammartino, D. C., Polyzos, A. & Apostolou, E. Transcription factors: building hubs in the 3D space. Cell Cycle 19, 2395–2410 (2020).
Lim, B. & Levine, M. S. Enhancer-promoter communication: hubs or loops? Curr. Opin. Genet Dev. 67, 5–9 (2021).
Miguel-Escalada, I. et al. Human pancreatic islet three-dimensional chromatin architecture provides insights into the genetics of type 2 diabetes. Nat. Genet 51, 1137–1148 (2019).
Oudelaar, A. M. et al. Single-allele chromatin interactions identify regulatory hubs in dynamic compartmentalized domains. Nat. Genet 50, 1744–1751 (2018).
Suzuki, M. M. & Bird, A. DNA methylation landscapes: provocative insights from epigenomics. Nat. Rev. Genet 9, 465–476 (2008).
Ball, M. P. et al. Targeted and genome-scale strategies reveal gene-body methylation signatures in human cells. Nat. Biotechnol. 27, 361–368 (2009).
Neri, F. et al. Intragenic DNA methylation prevents spurious transcription initiation. Nature 543, 72–77 (2017).
Teissandier, A. & Bourc’his, D. Gene body DNA methylation conspires with H3K36me3 to preclude aberrant transcription. EMBO J. 36, 1471–1473 (2017).
Kulis, M., Queirós, A. C., Beekman, R. & Martín-Subero, J. I. Intragenic DNA methylation in transcriptional regulation, normal differentiation and cancer. Biochimica et. Biophysica Acta (BBA) - Gene Regulatory Mechanisms 1829, 1161–1174 (2013).
Luo, Y. et al. New developments on the encyclopedia of DNA elements (ENCODE) data portal. Nucleic Acids Res. 48, D882–D889 (2020).
Silva, T. C. et al. ELMER v.2: an R/Bioconductor package to reconstruct gene regulatory networks from DNA methylation and transcriptome profiles. Bioinformatics 35, 1974–1977 (2019).
Yao, L., Shen, H., Laird, P. W., Farnham, P. J. & Berman, B. P. Inferring regulatory element landscapes and transcription factor networks from cancer methylomes. Genome Biol. 16, 105 (2015).
Rhie, S. K. et al. Identification of activated enhancers and linked transcription factors in breast, prostate, and kidney tumors by tracing enhancer networks using epigenetic traits. Epigenetics Chromatin 9, 50 (2016).
Schlosberg, C. E., VanderKraats, N. D. & Edwards, J. R. Modeling complex patterns of differential DNA methylation that associate with gene expression changes. Nucleic Acids Res. 45, 5100–5111 (2017).
Li, J., Ching, T., Huang, S. & Garmire, L. X. Using epigenomics data to predict gene expression in lung cancer. BMC Bioinforma. 16, S10 (2015).
Klett, H. et al. Robust prediction of gene regulation in colorectal cancer tissues from DNA methylation profiles. Epigenetics 13, 386–397 (2018).
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
Nora, E. P. et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 381–385 (2012).
Sexton, T. et al. Three-dimensional folding and functional organization principles of the drosophila genome. Cell 148, 458–472 (2012).
Davidson, I. F. et al. DNA loop extrusion by human cohesin. Science 366, 1338–1345 (2019).
Deaton, A. M. & Bird, A. CpG islands and the regulation of transcription. Genes Dev. 25, 1010–1022 (2011).
Ernst, J. & Kellis, M. Chromatin-state discovery and genome annotation with ChromHMM. Nat. Protoc. 12, 2478–2492 (2017).
Gebhard, C. et al. General transcription factor binding at CpG islands in normal cells correlates with resistance to de novo DNA methylation in cancer cells. Cancer Res. 70, 1398–1407 (2010).
Héberlé, É. & Bardet, A. F. Sensitivity of transcription factors to DNA methylation. Essays Biochem 63, 727–741 (2019).
Onuchic, V. et al. Allele-specific epigenome maps reveal sequence-dependent stochastic switching at regulatory loci. Science 361, eaar3146 (2018).
Bhattacharyya, S., Chandra, V., Vijayanand, P. & Ay, F. Identification of significant chromatin contacts from HiChIP data by FitHiChIP. Nat. Commun. 10, 4221 (2019).
Krijger, P. H. L. & de Laat, W. Regulation of disease-associated gene expression in the 3D genome. Nat. Rev. Mol. Cell Biol. 17, 771–782 (2016).
Ortabozkoyun, H. et al. CRISPR and biochemical screens identify MAZ as a cofactor in CTCF-mediated insulation at Hox clusters. Nat. Genet. 54, 202–212 (2022).
Canzio, D. & Maniatis, T. The generation of a protocadherin cell-surface recognition code for neural circuit assembly. Curr. Opin. Neurobiol. 59, 213–220 (2019).
Guo, Y. et al. CTCF/cohesin-mediated DNA looping is required for protocadherin α promoter choice. Proc. Natl Acad. Sci. 109, 21081–21086 (2012).
Rajarajan, P. et al. Neuron-specific signatures in the chromosomal connectome associated with schizophrenia risk. Science 362, eaat4311 (2018).
Jiang, T. et al. Identification of multi-loci hubs from 4C-seq demonstrates the functional importance of simultaneous interactions. Nucleic Acids Res. 44, 8714–8725 (2016).
Allahyar, A. et al. Enhancer hubs and loop collisions identified from single-allele topologies. Nat. Genet 50, 1151–1160 (2018).
Yeo, N. C. et al. An enhanced CRISPR repressor for targeted mammalian gene regulation. Nat. Methods 15, 611–616 (2018).
Klann, T. S. et al. CRISPR–Cas9 epigenome editing enables high-throughput screening for functional regulatory elements in the human genome. Nat. Biotechnol. 35, 561–568 (2017).
Doench, J. G. et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. 34, 184–191 (2016).
Sanson, K. R. et al. Optimized libraries for CRISPR-Cas9 genetic screens with multiple modalities. Nat. Commun. 9, 5416 (2018).
Dixit, A. et al. Perturb-seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167, 1853–1866.e17 (2016).
Wang, T., Wang, Z., Niu, R. & Wang, L. Crucial role of Anxa2 in cancer progression: highlights on its novel regulatory mechanism. Cancer Biol. Med. 16, 671–687 (2019).
Li, Q. et al. Downregulation of N-acetylglucosaminyltransferase GCNT3 by miR-302b-3p decreases non-small cell lung cancer (NSCLC) cell proliferation, migration and invasion. Cell Physiol. Biochem. 50, 987–1004 (2018).
Park, H. et al. AMIGO2, a novel membrane anchor of PDK1, controls cell survival and angiogenesis via Akt activation. J. Cell Biol. 211, 619–637 (2015).
Izutsu, R. et al. AMIGO2 contained in cancer cell-derived extracellular vesicles enhances the adhesion of liver endothelial cells to cancer cells. Sci. Rep. 12, 792 (2022).
Lee, Y.-Y. et al. Loss of tumor suppressor IGFBP4 drives epigenetic reprogramming in hepatic carcinogenesis. Nucleic Acids Res. 46, 8832–8847 (2018).
Loyfer, N. et al. A DNA methylation atlas of normal human cell types. Nature 613, 355–364 (2023).
Chakravarthy, A. et al. Pan-cancer deconvolution of tumour composition using DNA methylation. Nat. Commun. 9, 3220 (2018).
Chen, B., Khodadoust, M. S., Liu, C. L., Newman, A. M. & Alizadeh, A. A. Profiling tumor infiltrating immune cells with CIBERSORT. Methods Mol. Biol. 1711, 243–259 (2018).
Buniello, A. et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005–D1012 (2019).
Rheinbay, E. et al. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature 578, 102–111 (2020).
Khurana, E. et al. Role of non-coding sequence variants in cancer. Nat. Rev. Genet 17, 93–108 (2016).
Pott, S. & Lieb, J. D. What are super-enhancers? Nat. Genet 47, 8–12 (2015).
Hnisz, D. et al. Super-enhancers in the control of cell identity and disease. Cell 155, 934–947 (2013).
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Tay, J. K., Narasimhan, B. & Hastie, T. Elastic net regularization paths for all generalized linear models. J. Stat. Softw. 106, 1–31 (2023).
Therneau, T. M., until 2009, T. L. (original S.->R port and R. maintainer, Elizabeth, A. & Cynthia, C. survival: Survival Analysis, 2023).
Wingett, S. et al. HiCUP: pipeline for mapping and processing Hi-C data. F1000Res 4, 1310 (2015).
Cairns, J., Pritchett, P. F., Wingett, S. & Spivakov, M. Chicago: CHiCAGO: Capture hi-c analysis of genomic organization. bioconductor version: release (3.17) https://doi.org/10.18129/B9.bioc.Chicago (2023).
Lun, A. T. L. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 20, 63 (2019).
Ritchie, M. E. et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
Goldman, M. J. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat. Biotechnol. 38, 675–678 (2020).
Raney, B. J. et al. The UCSC genome browser database: 2024 update. Nucleic Acids Res. 52, D1082–D1088 (2024).
Mumbach, M. R. et al. Enhancer connectome in primary human cells identifies target genes of disease-associated DNA elements. Nat. Genet 49, 1602–1612 (2017).
Acknowledgements
These studies were supported by a P01CA229086 (JAS, AT, AH) and 2R35GM122515 (JAS). GC and GJ were supported by fellowships from the NCC.
Author information
Authors and Affiliations
Contributions
These studies were designed by Theodore Sakellaropoulos, Jane A Skok, Aristotelis Tsirigos and Catherine Do. All the analysis was performed by Theodore Sakellaropoulos. The Perturb-seq experiment was performed by Guimei Jiang; the promoter-capture Hi-C by Giulia Cova, Sitharam Ramaswami, and Dacia Dimartino, supervised by Adriana Heguy. scRNA-seq for the perturb-seq was performed by Peter Meyn. The paper was written by Theodore Sakellaropoulos, Catherine Do and Jane Skok.
Corresponding authors
Ethics declarations
Competing interests
Aristotelis Tsirigos is a scientific advisor to Intelligencia AI. The rest of the authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Gong-Hong Wei, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sakellaropoulos, T., Do, C., Jiang, G. et al. MethNet: a robust approach to identify regulatory hubs and their distal targets from cancer data. Nat Commun 15, 6027 (2024). https://doi.org/10.1038/s41467-024-50380-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-024-50380-3
- Springer Nature Limited