Abstract
Unveiling gene interactions is crucial for comprehending biological processes, particularly their combined impact on phenotypes. Computational methodologies for gene interaction discovery have been extensively studied, but their application to censored data has yet to be thoroughly explored. Our work introduces a data-driven approach to identifying gene interactions that profoundly influence survival rates through the use of survival analysis. Our approach calculates the restricted mean survival time (RMST) for gene pairs and compares it against their individual expressions. If the interaction’s RMST exceeds that of the individual gene expressions, it suggests a potential functional association. We focused on L1000 landmark genes using TCGA na METABRIC data sets. Our findings demonstrate numerous additive and competing interactions and a scarcity of XOR-type interactions. We substantiated our results by cross-referencing with existing interactions in STRING and BioGRID databases and using large language models to summarize complex biological data. Although many potential gene interactions were hypothesized, only a fraction have been experimentally explored. This novel approach enables biologists to initiate a further investigation based on our ranked gene pairs and the generated literature summaries, thus offering a comprehensive, data-driven approach to understanding gene interactions affecting survival rates.
Supported by the Slovenian Research Agency grants P2-0209 and L2-3170.
You have full access to this open access chapter, Download conference paper PDF
Keywords
- survival analysis
- censored data
- RMST
- gene expression
- gene interactions
- literature mining
- large language models
1 Introduction
Survival analysis is a set of statistical methods used to study the time until an event of interest occurs and is commonly used in medical research to estimate life expectancy based on patient-specific data [21]. A pivotal aspect of survival analysis is estimating survival curves and comparing the probability of survival over time between different cohorts [4]. In biomedicine, we can relate the differences in survival to potential markers such as specific genes [2] or groups of genes [22], which can help distinguish patients who respond to treatments from those who do not (see Fig. 1) [27].
Rather than a single gene, intricate networks of gene interactions determine the complex nature of diseases such as cancer [26]. Identifying and characterizing these interactions is essential, as they offer critical insights into the onset and progression of a disease, potentially overlooked when analyzing individual genes. Computational discovery of gene interactions is a well-researched area in genome-wide association (GWAS) [17] and gene expression-based phenotype categorization [5]. For the former, a notable approach for handling survival data is the adaption of multifactor dimensionality reduction (MDR) [6, 15]. Authors also typically utilize Cox regression analysis to analyze the interaction effects of candidate genes [28, 30]. Analyzing survival data is crucial in the clinical domain, highlighting the need for more systematic, data-driven methodologies to unravel intricate gene interactions linked to survival data. Computational methods that specifically address gene interactions from survival data are, at best, scarce, and due to the recent abundance of survival data that includes gene expression, there is a need for their development.
Here we report on a data-driven approach for identifying gene interactions significantly affecting survival rates. In the context of our study, gene interaction refers to the combined effect of two genes on survival, which may be substantially different from their individual effects. Our method aims to measure this interaction effect, quantified as the difference in restricted mean survival time (RMST) when considering the expression of both genes together compared to the expression of individual genes. We then rank gene pairs based on the significance of the difference in RMST. We use top-ranked gene pairs, cross-reference our findings with documented interactions, and synthesize complex literature findings using large language models, thus expanding the exploratory scope of our study.
In Sect. 2, we start with (1) introducing the data, (2) describing how we measured the effect on survival, (3) explaining the measure of interaction and how we define different types of interactions, and (4) describe the utilization of large language models when cross-referencing our findings with existing literature. Section 3 briefly describes our analysis findings, followed by a discussion of limitations and possible future work in Sect. 4.
2 Methods
Our method focuses on two-gene interactions and unfolds through a four-step process. First, we separate samples into evenly sized groups according to the median gene expression value. Subsequently, we estimate survival curves for each group, and for each survival curve, we compute the restricted mean survival time (RMST). We then quantify the difference in RMST between the groups. Lastly, we assess the interaction effect by evaluating how significant is the RMST difference between the interaction term, as discussed in Sect. 2.4 and participating genes. We replicate this procedure for each gene pair in our data set during our discovery-driven analysis and rank them based on their interaction effect. This ranked list paves the way for biologists to initiate their interpretation and investigation. To aid this process, we implement literature mining and harness the utility of large language models to distill complex biological knowledge for assistance and interpretation.
2.1 Data
In this study, we leveraged two sources of survival data:
-
TCGA. We procured RNA-Seq data, including gene expression matrices and corresponding survival endpoints, for various cancer types from The Cancer Genome Atlas via the GEO portal ( GSE62944) [16]. Given the variability in sample size across different datasets, we included only those with more than 100 samples, resulting in 20 TCGA datasets.
-
METABRIC. We obtained microRNA gene expression matrix and patient survival data from The Molecular Taxonomy of Breast Cancer International Consortium through cBioPortal [3].
Across all datasets in our study, we implemented a log transformation on each gene expression value supplemented with a pseudo count 1. Additionally, z-score normalization was carried out on each gene across samples within a dataset, essentially standardizing the columns of the expression matrix. We utilized clinical metadata for each sample’s overall survival (OS) time and event status. OS time refers to the most recent date a patient was confirmed alive. The event is recorded when a patient dies due to the condition under study, in this case, cancer. If a patient’s status is unknown or death occurs due to unrelated causes, we classify the event status as censored. Note that sample sizes and event rates vary across datasets Table 1.
To limit our exploration scope, we have focused solely on a specific set of genes referred to as L1000 genes [23]. The L1000 gene set contains roughly one thousand landmark genes acting as proxies to infer the expression of other genes. Using this curated set of landmark genes, we significantly reduced the dimensionality of our search space to a set of 1058 genes. Additionally, we removed genes with low expression values to reduce noise before we proceeded with computation. We have disregarded genes with a 75th percentile expression value lower than 10.
2.2 Summary Measure of Survival: Restricted Mean Survival Time
Restricted Mean Survival Time (RMST) is the average survival time up to a pre-specified time point, quantified as the area under the survival curve up to that point (see Fig. 2) [29]. Its primary benefits are that it is interpretable, provides a meaningful summary of survival data, and is considered more robust than measures of median survival time [7].
Building upon its intuitive nature, RMST has gained substantial traction for its versatile utility in comparing differences in survival between cohorts [19]. The difference in RMST is an alternative means to measure gains or losses in the event-free survival between different groups of patients (see Fig. 3). Unlike the log-rank test, which heavily relies on the assumption of proportional hazards and may be sensitive to instances of crossing survival curves, the difference in RMST presents a more flexible and reliable approach [25].
2.3 Interaction Scoring
We have devised a data-driven approach to identify interaction revealing significant RMST differences. This difference implies that the combined influence of both features on survival differs considerably from the individual influence of each feature. While this technique broadly applies to various types of data, our primary focus here is on gene expression data, which we use to determine the combined influence of gene pairs on survival outcomes compared to their individual effects.
The steps summarized with Algorithm 1 are following:
-
1.
First, we partition samples into two cohorts based on the median expression value of a particular gene. Each cohort represents a group of patients with either low or high gene expression values (line 2).
-
2.
For each cohort, we calculate its Kaplan-Meier survival curve. (line 3). Next, we compute the RMST for each survival curve (line 4). We limit RMST computation to the 75th percentile of all survival times in the cohort to circumvent potential issues arising from uncertainty in survival estimates of long survivors and to ensure a fair comparison across different cohorts by consistently applying the same upper bound.
-
3.
We calculate the absolute difference in RMST between the two created cohorts (line 5). This difference effectively represents the area between the survival curves, providing a measure of the disparity in survival outcomes between the two groups (as shown in Fig. 3).
-
4.
To determine whether an interaction effect exists, we first calculate the RMST differences for the individual genes and their interaction (lines 6- 8). We then compute the interaction measure as the absolute difference between the largest individual RMST difference and the difference in RMST for the interaction term (line 10).
2.4 Interaction Types
We define three types of interactions between genes that correspond to different cohort formations (see Fig. 4). Using standardized gene expression values of two genes, we construct a new feature and create cohorts using the approach mentioned earlier. Gene interactions measured with this approach should be interpreted with respect to survival and not as physical interactions.
The first interaction is an additive (+) interaction, where standardized gene expression values of both genes are summed together. Such interactions are more common for genes of protein complex subunits.
The second interaction is a competing (-) interaction, where standardized gene expression values are subtracted. The cohorts represent which of the two genes was more expressed. Such interactions are more common for activator and inhibitor-type interactions, where both genes regulate the same process.
The last interaction is an XOR-type (\(\times \)) interaction, where we multiply standardized gene expression values. These interactions are more complex and are scarce in nature. They may result from the alternative signaling pathways to the same process influencing survival.
2.5 Discovering False Positives with Permutation Test
To identify potential false positive interactions, we performed a permutation test for every data set and interaction type, which involved random shuffling of the survival endpoint and rerunning the experiment 100 times. Given that we conducted 100 such permutation runs per data set and different interaction types, the computation required was extensive due to the sheer volume of potential combinations to examine. Our analysis yielded results that allowed us to isolate the top 0.01% interactions, deemed non-random occurrences. In essence, we consider interactions exceeding the 99.99th percentile as potential interaction hits.
2.6 Literature Mining
We propose to use literature mining to, where possible, explain the interactions and synthesize intricate biological knowledge, leveraging the power of large language models. Specifically, we have used GPT-3.5 and GPT-4 developed by OpenAI. We focused on each data set’s top 100 ranked gene interaction pairs and interaction types. These were cross-referenced within STRING [24] and BioGRID [14] databases to ascertain how many gene pairs are in those intricate networks of interactions. We also determined the number of shortest paths and the shortest path length between gene pairs within the BioGRID interaction network. We also incorporated UniProt descriptions of all genes under investigation to supplement our analysis [1].
Having performed initial analyses, we then concentrated on the top 10 ranked gene pairs and interactions previously reported in the literature. Utilizing the language models, we sought to condense the complex biological context, prompting the models to extrapolate potential functional associations between these genes. The UniProt functional descriptions of gene pairs and some genes found in the shortest path within the bioGRID interaction network informed the models’ prompts.
3 Results
With our proposed approach we performed the analysis on TCGA and METABRIC datasets.
3.1 Analysis Reveals Potential Interactions
We overlay interaction hits, as described in Sect. 2.5 with permutation test results. The average number of interactions above the threshold for permutations was always 55.9, equivalent to 0.01% of all tested interactions. The tail of the distribution corresponding to the 99.99% of interactions is also visualized (see Fig. 5).
The number of additive and competing interaction hits overwhelmingly exceeded the 56 random interaction threshold for almost all data sets (Table 2). The number of additive interactions is generally lower than the number of competing interactions for the same data set. On the other hand, XOR-type interactions are scarce and found in abundance only in one out of 21 data sets tested. Interestingly, there was no correlation between the number of interaction hits and samples or events in the data set.
3.2 Cross-Referencing with Established Interaction Networks
We have cross-referenced the top 100 ranked gene interactions against known gene interaction networks in STRING and BioGRID. Our findings indicate that many of these interactions have some form of confirmation in these referenced databases. Additionally, we performed these steps using randomly selected pairs of genes instead of our top-ranked list and repeated this random sampling process a thousand times. As illustrated in Fig. 6, competing interactions from HNSC and KIRC emerge as interesting outliers. On average, the top additive and XOR interactions are more scarce in the databases than competing interactions.
Given the surprisingly high number of documented interactions, even among randomly selected gene pairs, we hypothesize that because we are dealing with well-established genes, enhancing the likelihood of their documentation in high-throughput analyses. These analyses are typically characterized by their ability to investigate thousands of genes simultaneously, which are then reported in databases like BioGRID.
3.3 Case Study: RHOA-CD44 Competing Interaction
We present one of the top 3 competing-type gene interaction hits from the kidney renal clear cell carcinoma (TCGA-KIRC) data set with confirmed interaction in both STRING and BioGRID database (see Fig. 7a). Competing interaction between RHOA and CD44 genes shows more than five months larger difference between cohorts than any of these genes individually (see Fig. 7b).
CD44 gene produces a cell surface receptor that binds Hyaluronan (HA) and is involved in cell-cell interactions, adhesion, and migration. It serves for signal transduction to different pathways, including cytoskeleton reorganization via RhoA small GTPase [8]. Overexpression of CD44 was related to poor prognosis in glioblastomas [20] and renal cell carcinoma [12] but had no significant effect on breast cancer patient survival [18].
RhoA gene produces small GTPases, which function as molecular switches mainly in cytoskeleton dynamics and cell migration [10]. Increased RhoA-ROCK activities mediate the upregulation of tumor suppressor p53 and induce G1 cell cycle arrest in kidney cell lines [13]. It has been shown that reduced RhoA expression enhances metastasis in breast cancer [9].
Observing Kaplan-Meier plots for both genes’ high and low expression cohorts confirms findings from the literature (see Fig. 7c,d). Our method reveals a competing interaction between CD44 and RhoA genes. We interpret this as a competition between CD44 and RhoA-related biology, where the higher expressed gene prevails. Note that we are comparing relative expressions according to the mean expression in the data set (see Fig. 7e). When RhoA is highly expressed, it inhibits the tumor suppression mechanism. Only when CD44 is more expressed than RhoA it sufficiently activates downstream pathways to have a significant effect on survival over the effect of RhoA gene (see Fig. 7f).
4 Discussion
Our results suggest a novel ability to identify interactions significantly affecting survival outcomes, thus unveiling insights into the complex landscape of gene interplay and disease prognosis. Even so, our methodology’s ranked gene interaction lists should be interpreted cautiously, serving primarily as an exploratory analysis. Due to the vastness of possible gene interactions, we expect some to arise purely by chance. Our preliminary work with permutation tests and literature mining only provides some supportive evidence against these findings. Our analysis identified several potential gene interactions affecting patient survival rates, providing a basis for further in-depth investigations. Particularly noteworthy is the abundance of XOR-type interactions in the HNCS dataset.
Our study also reveals an intriguing potential for large language models to summarize complex biological knowledge when fed with adequate context. By distilling intricate gene pair interactions and their associated functions as informed by resources like UniProt and interaction network databases, the models demonstrated their capacity to reason about known interactions, speculate on potential associations, and guide future exploratory directions (as illustrated with an example in Fig. 8). Although the present analysis should not be regarded as a definitive evaluation of interaction, it establishes an efficient pipeline to facilitate knowledge synthesis and accelerate the pace of scientific discovery, as demonstrated in the case study above.
We also recognized noticeable differences in the quality of summaries generated by GPT-3.5 and GPT-4, indicating a trend of improved comprehension and representation of complex biological interactions with newer model iterations. This observation suggests a promising area for future research - the potential of customized language models, fine-tuned on recent, domain-specific literature, which could serve as a more streamlined and context-aware alternative to the vast, generalized models currently accessed via APIs.
While our study presents interesting insights, several limitations present opportunities for future exploration and refinement. The choice of equally-sized cohorts, achieved by splitting at the median, does not account for potential variations in the cohort splits that might optimize the difference in RMST between cohorts. Additionally, we did not consider the potential influence of time limits on RMST calculations, which could significantly impact results and can be very study specific. Lastly, our analysis was constrained by a low number of samples relative to the vast space of possible feature interactions. The enormous space of potential feature interactions may limit the generalizability of our findings. Future work is required to address these limitations and deepen the insights offered by our proposed methodology.
5 Conclusions
The prevalent nature of censored data and molecular fingerprints in clinical environments highlights the need for techniques to illuminate the biological processes regulating disease progression. Unraveling gene interactions is fundamental in understanding these processes, specifically their collective effects on phenotypes.
We report on our work to introduce a data-centric method for detecting gene interactions significantly affecting survival rates, leveraging restricted mean survival times. Using the proposed approach, we can identify possible novel gene interaction candidates on publicly available datasets. We further contextualize the hypothesized gene interactions through literature mining and using large language models to distill complex biological knowledge for assistance and interpretation. In a case study, we show the applicability of such an approach and its potential to uncover and explain potential new interactions.
We have made our method’s implementation and the accompanying data and scripts available on GitHubFootnote 1 and archived them on Zenodo [11]. These resources include the extended results of permutation tests, summaries produced by the language models, and the prompt used to generate them.
References
Uniprot: the universal protein knowledgebase in 2023. Nucleic Acids Research 51(D1), D523–D531 (2023)
Beer, D.G., et al.: Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med. 8(8), 816–824 (2002)
Curtis, C., et al.: The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486(7403), 346–352 (2012)
Dey, T., Mukherjee, A., Chakraborty, S.: A practical overview and reporting strategies for statistical analysis of survival studies. Chest 158(1), S39–S48 (2020)
Evans, L.M., et al.: Transcriptome-wide gene-gene interaction associations elucidate pathways and functional enrichment of complex traits. PLoS Genet. 19(5), e1010693 (2023)
Gui, J., Moore, J.H., Kelsey, K.T., Marsit, C.J., Karagas, M.R., Andrew, A.S.: A novel survival multifactor dimensionality reduction method for detecting gene-gene interactions with application to bladder cancer prognosis. Hum. Genet. 129, 101–110 (2011)
Han, K., Jung, I.: Restricted mean survival time for survival analysis: a quick guide for clinical researchers. Korean J. Radiol. 23(5), 495 (2022)
Hassn Mesrati, M., Syafruddin, S.E., Mohtar, M.A., Syahir, A.: CD44: a multifunctional mediator of cancer progression. Biomolecules 11(12), 1850 (2021)
Kalpana, G., Figy, C., Yeung, M., Yeung, K.C.: Reduced RhoA expression enhances breast cancer metastasis with a concomitant increase in CCR5 and CXCR4 chemokines signaling. Sci. Rep. 9(1), 16351 (2019)
Kim, J.G., et al.: Regulation of RhoA GTPase and various transcription factors in the RhoA pathway. J. Cell. Physiol. 233(9), 6381–6392 (2018)
Kokošar, J., Špendl, M.: biolab/discovery-science-2023: Release 1.0 (2023). https://doi.org/10.5281/zenodo.8023658
Li, X.: Prognostic value of CD44 expression in renal cell carcinoma: a systematic review and meta-analysis. Sci. Rep. 5(1), 13157 (2015)
Miyazaki, J., et al.: Progression of human renal cell carcinoma via inhibition of RhoA-rock axis by parg1. Transl. Oncol. 10(2), 142–152 (2017)
Oughtred, R., et al.: The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci. 30(1), 187–200 (2021)
Park, M., Lee, J.W., Park, T., Lee, S.: Gene-gene interaction analysis for the survival phenotype based on the kaplan-meier median estimate. BioMed Research International 2020 (2020)
Rahman, M., Jackson, L.K., Johnson, W.E., Li, D.Y., Bild, A.H., Piccolo, S.R.: Alternative preprocessing of RNA-sequencing data in the cancer genome atlas leads to improved analysis results. Bioinformatics 31(22), 3666–3672 (2015)
Ritchie, M.D., Van Steen, K.: The search for gene-gene interactions in genome-wide association studies: challenges in abundance of methods, practical considerations, and biological interpretation. Ann. Transl. Med. 6(8), 157 (2018)
Roosta, Y., Sanaat, Z., Nikanfar, A.R., Dolatkhah, R., Fakhrjou, A.: Predictive value of CD44 for prognosis in patients with breast cancer. Asian Pacific J. Cancer Prev. APJCP 21(9), 2561 (2020)
Royston, P., Parmar, M.K.: Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Med. Res. Methodol. 13(1), 1–15 (2013)
Si, D., Yin, F., Peng, J., Zhang, G.: High expression of CD44 predicts a poor prognosis in glioblastomas. Cancer Manage. Res. 12, 769 (2020)
Singh, R., Mukhopadhyay, K.: Survival analysis in clinical trials: basics and must know areas. Perspect. Clin. Res. 2(4), 145 (2011)
Špendl, M., Kokošar, J., Praznik, E., Ausec, L., Zupan, B.: Ranking of survival-related gene sets through integration of single-sample gene set enrichment and survival analysis. In: Juarez, J.M., Marcos, M., Stiglic, G., Tucker, A. (eds.) AIME 2023. LNCS, vol. 13897, pp. 328–337. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-34344-5_39
Subramanian, A., et al.: A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171(6), 1437–1452 (2017)
Szklarczyk, D., et al.: String v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47(D1), D607–D613 (2019)
Uno, H., et al.: Moving beyond the hazard ratio in quantifying the between-group difference in survival analysis. J. Clin. Oncol. 32(22), 2380 (2014)
Van Steen, K.: Travelling the world of gene-gene interactions. Brief. Bioinform. 13(1), 1–19 (2012)
Vargas, A.J., Harris, C.C.: Biomarker development in the precision medicine era: lung cancer as a case study. Nat. Rev. Cancer 16(8), 525–537 (2016)
Zhang, R., et al.: Independent validation of early-stage non-small cell lung cancer prognostic scores incorporating epigenetic and transcriptional biomarkers with gene-gene interactions and main effects. Chest 158(2), 808–819 (2020)
Zhao, L., et al.: On the restricted mean survival time curve in survival analysis. Biometrics 72(1), 215–221 (2016)
Zhu, J., et al.: A two-phase comprehensive NSCLC prognostic study identifies lncRNAs with significant main effect and interaction. Mol. Genet. Genomics 297(2), 591–600 (2022)
Acknowledgements
This work was supported by the Slovenian Research Agency Program Grant P2-0209 and Project Grants L2-3170 and V2-2272.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Kokošar, J., Špendl, M., Zupan, B. (2023). Gene Interactions in Survival Data Analysis: A Data-Driven Approach Using Restricted Mean Survival Time and Literature Mining. In: Bifet, A., Lorena, A.C., Ribeiro, R.P., Gama, J., Abreu, P.H. (eds) Discovery Science. DS 2023. Lecture Notes in Computer Science(), vol 14276. Springer, Cham. https://doi.org/10.1007/978-3-031-45275-8_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-45275-8_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45274-1
Online ISBN: 978-3-031-45275-8
eBook Packages: Computer ScienceComputer Science (R0)