Abstract
Ulcerative colitis (UC) is a chronic inflammatory bowel disease with intricate pathogenesis and varied presentation. Accurate diagnostic tools are imperative to detect and manage UC. This study sought to construct a robust diagnostic model using gene expression profiles and to identify key genes that differentiate UC patients from healthy controls. Gene expression profiles from eight cohorts, encompassing a total of 335 UC patients and 129 healthy controls, were analyzed. A total of 7530 gene sets were computed using the GSEA method. Subsequent batch correction, PCA plots, and intersection analysis identified crucial pathways and genes. Machine learning, incorporating 101 algorithm combinations, was employed to develop diagnostic models. Verification was done using four external cohorts, adding depth to the sample repertoire. Evaluation of immune cell infiltration was undertaken through single-sample GSEA. All statistical analyses were conducted using R (Version: 4.2.2), with significance set at a P value below 0.05. Employing the GSEA method, 7530 gene sets were computed. From this, 19 intersecting pathways were discerned to be consistently upregulated across all cohorts, which pertained to cell adhesion, development, metabolism, immune response, and protein regulation. This corresponded to 83 unique genes. Machine learning insights culminated in the LASSO regression model, which outperformed others with an average AUC of 0.942. This model's efficacy was further ratified across four external cohorts, with AUC values ranging from 0.694 to 0.873 and significant Kappa statistics indicating its predictive accuracy. The LASSO logistic regression model highlighted 13 genes, with LCN2, ASS1, and IRAK3 emerging as pivotal. Notably, LCN2 showcased significantly heightened expression in active UC patients compared to both non-active patients and healthy controls (P < 0.05). Investigations into the correlation between these genes and immune cell infiltration in UC highlighted activated dendritic cells, with statistically significant positive correlations noted for LCN2 and IRAK3 across multiple datasets. Through comprehensive gene expression analysis and machine learning, a potent LASSO-based diagnostic model for UC was developed. Genes such as LCN2, ASS1, and IRAK3 hold potential as both diagnostic markers and therapeutic targets, offering a promising direction for future UC research and clinical application.
Similar content being viewed by others
Introduction
Ulcerative colitis (UC) is indeed an inflammatory bowel disease (IBD) that predominantly impacts the mucosal and submucosal layers of the colon and rectum, manifesting as a chronic condition characterized by inflammation and the formation of ulcers in the lining of the colon and rectum. Simultaneously, prolonged UC results in structural damage, amplifying the susceptibility to conditions such as colon cancer and extraintestinal malignancies1,2. However, the pathogenesis of UC remains a complex and not fully elucidated process. It is currently understood that UC predominantly affects individuals with genetic susceptibility, while factors such as epithelial barrier defects, dysbiosis, and dysregulated immune responses play significant roles in its pathogenesis3,4,5. Epidemiologically, the incidence and prevalence of UC have been a dramatic rise in recent years. Globally, the highest incidence and prevalence are in Northern Europe, 505 per 100,000 in Norway, followed by North America, 286 per 100,000 in the USA6. The annual incidence of UC in Europe has surged to 24.3 cases per 100,000 individuals, and there is a clear upward trajectory in both the prevalence and incidence of UC over time7. It’s worth noting that in many emerging industrialized countries in South America, Asia, and Africa, although the prevalence is still low, the number of new UC diagnoses is increasing, and the prevalence is expected to rise in the future8. This presents a substantial challenge for healthcare systems on a global scale.
A potential pathogenesis of UC could be immune system dysfunction. When the immune system works hard to resist invading viruses or bacteria, an abnormal immune response can cause the immune system to also attack cells in the digestive tract, leading to chronic intestinal inflammation or mucosal damage. Genetics also play a role as UC is more common in people with family members who have the disease9. In the past, UC was commonly managed with 5-aminosalicylates, steroids, and thiopurines. However, despite these treatment options, UC continues to significantly affect patients' quality of life and is associated with a high morbidity rate10. Procedures such as ileo-pouch-anal anastomosis and colectomy come with the potential risks of infertility, compromised pouch function, and the development of capsulitis11. In recent years, targeted therapeutic agents like tumor necrosis factor (TNF) inhibitors and interleukin inhibitors have garnered increased attention in clinical practice. With ongoing advancements in drug development, there has been a substantial decrease in UC-related mortality, enhancing the overall prognosis for patients with UC12. Nonetheless, there is undeniably substantial room for enhancement in the management of UC, as indicated by existing studies that report remission rates (Based on clinical improvements in stool frequency, rectal bleeding, and mucosal appearance on endoscopy, Mayo score) typically falling below 20–30%12.
The diagnosis of UC primarily rests on a combination of clinical symptoms, endoscopic findings, histological examination, and exclusion of other causes of colitis, such as infections13,14. Serological markers and fecal calprotectin can assist in differentiating UC from other gastrointestinal disorders, but they are not definitive. Looking ahead, there is growing interest in the realm of genetics for diagnostic insights. Recent advancements in genome-wide association studies (GWAS) have identified numerous genetic loci associated with UC susceptibility15,16. The clinical symptoms might also correlate with genetic alterations, gene expression profiles in symptomatic controls, from whom inflammatory bowel disease (IBD) had been excluded, resembled those of IBD patients and diverged from healthy controls. The gene expression signatures of these IBD-excluded samples were related to their symptomatic status17. Crooke et al. detected the transcript levels of a total of 45 genes in blood by quantitative real-time polymerase chain reaction, and then used ratio score and support vector machine methods to distinguish UC from several types of gastro-intestinal diseases18. Recent years, next-generation sequencing is widely applied in disease diagnostic and precision treatment19,20. As our understanding of the genetic architecture of UC deepens, it is anticipated that genetic markers could serve as adjunct diagnostic tools, offering more precise disease categorization and personalized therapeutic strategies. This burgeoning area of research holds the promise of reshaping the diagnostic landscape of UC in the future.
The objective of current study is to explore the potential of gene expression profiles in enhancing the accuracy and early detection of UC, particularly in cases where traditional diagnostic methods may be inconclusive. While traditional diagnostics are indeed effective and cost-efficient, gene expression profiling offers several distinct advantages. These include the ability to identify molecular changes at an early stage, which may precede clinical symptoms, thus enabling earlier intervention and potentially improving patient outcomes. In this study, we incorporated soft tissue sequencing data from a cohort of 259 UC patients and 60 individuals without the condition. From this dataset, we identified six key genes and developed a predictive model with a high degree of accuracy for UC diagnosis.
Methods
Patients’ summary
We collected a total of eight cohorts contains both health controls and UC patients for the current study. The training datasets derived from mucosal tissue samples included GSE87466 with 21 normal and 87 UC patients, GSE59071 with 11 normal and 97 UC patients, GSE47908 with 15 normal and 45 UC patients, and GSE38713 with 13 normal and 30 UC patients. For validation, the mucosal tissue cohorts comprised GSE53306, which had 12 normal controls, 16 patients in the active UC category and 12 in the inactive UC category. Similarly, GSE13367 had 8 inflamed and 9 non-inflamed UC patients, compared with 10 controls. GSE48958 also from mucosal tissue had 7 active UC and 6 inactive UC patients, accompany with 8 controls. Finally, the GSE126124 dataset, derived from peripheral whole blood, included 39 normal and 18 UC patients (Table 1).
Mitigating batch effects
Batch effects represent the non-biological discrepancies observed across multiple datasets. To ensure analytical consistency and mitigate biases introduced by such effects, we employed the ComBat algorithms from the "sva" package. This methodology was instrumental in harmonizing the transcriptional profiles of the training cohorts (GSE87466, GSE59071, GSE47908, GSE38713), thus effectively offsetting the intrinsic batch differences among them. For the validation cohort, we abstained from this procedure, as our intent was to further authenticate the diagnostic across diverse platforms.
Calculation of the scores of signaling pathways
Gene Set Enrichment Analysis (GSEA) is a computational approach ascertaining whether a designated gene set exhibits statistically significant deviations between two groups. We implemented GSEA to initially contrast the various activated signaling pathways between UC patients and healthy controls. The backdrop file of molecular signature gene sets was procured from MSigDB, C5: Biological Process, comprising a total of 7530 gene sets21,22.
An integrative diagnostic model leveraging machine learning techniques
To craft a unified model possessing robust accuracy and stability in distinguishing between UC patients and healthy individuals, we amalgamated 10 machine learning algorithms, yielding 101 algorithmic combinations. The ensemble of algorithms comprised Elastic Net (Enet), Lasso, Ridge, Stepglm[both], Stepglm[backward], glmBoost, Latent Dirichlet Allocation (LDA), NaiveBayes, plsRglm, Random Forest (RF), and Support Vector Machine (SVM). The signature derivation protocol entailed: (1) Isolating the most prominently activated pathways in UC patients across the four GEO cohorts; (2) Subsequently, the 101 algorithmic combinations were executed on the genes curated from these prominently activated pathways; (3) All models underwent training within the GSE55235 dataset and validation in the remaining three cohorts, which remained untouched during pathway filtration; (4) For every model, the AUC metric was ascertained across all participating cohorts.
Evaluation immunocytes infiltration
Through single-sample gene set enrichment analysis (ssGSEA), the infiltration of immune cells was discerned and evaluated using transcriptional data. The gene collections representing 28 immune cell types were sourced from the research undertaken by Charoentong et al23.
Statistics
Tatistical analyses were conducted using R (Version: 4.2.2). For continuous variables, the Student's t-test and the two-sample Mann–Whitney test were employed for comparisons between two groups if data exhibited a normal distribution, whereas the Wilson rank test was invoked otherwise. A Pearson correlation analysis was employed for continuous datasets. Pertinent pathways were delineated using a heatmap, facilitated by the R package "pheatmap". The Kappa Statistic serves as a metric for contrasting predictive versus actual subtypes. For comparisons across more than two groups, the Kruskal–Wallis test was utilized, and for pairwise assessments, the Wilcoxon test was applied24. A two-tailed P value below 0.05 was considered to indicate statistical significance.
Results
Summarize of the process
In this study, transcriptomic data from four cohorts, encompassing Ulcerative Colitis (UC) patients and healthy controls, were evaluated to identify key signaling pathways associated with UC. The gene expression profiles underwent batch correction to ensure uniformity and mitigate batch effects. Using Gene Set Enrichment Analysis (GSEA), over 7500 gene sets were computed, each representing a unique cellular signaling pathway. Machine learning techniques were then employed, with the LASSO regression model emerging as the most efficient diagnostic tool with an average AUC value of 0.942. The robustness of this model was validated using external cohorts. From the diagnostic model, 13 characteristic genes were identified and assessed for their expression differences. Three of these genes, LCN2, ASS1, and IRAK3, were particularly noteworthy as they exhibited elevated expression in UC patients. The study further examined the relationship between these genes and immune cell infiltration, establishing their correlation with activated dendritic cells. These findings reinforce the role of immune system dysregulation in UC and introduce potential biomarkers for diagnostic and therapeutic applications. The flowchart of the current study is displayed in Fig. 1.
Identifying key signaling pathways reflecting UC
As delineated in the methods section, our study incorporated samples from four cohorts, encompassing both UC patients and healthy controls. To ensure uniformity of the transcriptomic data before further analysis, we initially subjected the gene expression profiles from all four cohorts to batch correction. Prior to this correction, the PCA plot exhibited pronounced disparities among the four cohorts (Fig. 2A). However, post-correction, batch effect variations in gene expression distribution across all cohorts were effectively nullified (Fig. 2B). Subsequently, employing the GSEA method, we computed 7530 gene sets, each reflecting the activation status of distinct cellular signaling pathways; each sample included in the analysis garnered a score across these 7530 pathways. The distribution of scores for these pathways across samples in the different cohorts is illustrated in Fig. 2C.
Subsequent to this, within each cohort, we discerned signaling pathways that were differentially activated between UC patients and healthy controls (Fig. 3A). In the GSE38713 cohort, 79 pathways were upregulated in UC patients; in the GSE47908 cohort, 428 pathways were upregulated; in the GSE59071 cohort, 107 pathways were upregulated, and in the GSE87466 cohort, 3,609 pathways saw upregulation in UC patients. By extracting the intersecting upregulated pathways across the four cohorts, a total of 19 pathways were finalized (Fig. 3B). These 19 pathways pertained to cell adhesion and development, cell respiration and metabolism, immune response and signaling, as well as regulation of protein activity and secretion (Fig. 3C). Excluding the redundant genes within these pathways, a total of 83 unique genes remained.
Machine learning constructs a model for identifying patients with UC
The predictors used as input for the ML models are the gene expression levels of the 83 identified genes. These variables are continuous, representing the expression levels of each gene. Through the iterative analysis of the selected 83 genes across 101 algorithm combinations, 40 combination models were successfully generated. These models displayed their predictive capabilities across different cohorts using AUC values, with the average AUC value across four cohorts also being computed (Fig. 4A). Ultimately, the LASSO regression model demonstrated superior diagnostic capabilities (Average AUC = 0.942). The prediction score can be calculated with the formula: Score = 0.03328012 × SYK + 0.51625614 × CALR − 0.14331840 × GATA5 + 1.29808010 × FLRT2 + 0.80143919 × IRAK3 − 0.59448664 × DUSP26 + 0.85254969 × SPINK5 + 0.25364614 × PTPN6 + 0.44029637 × LCN2 + 0.70178103 × ASS1 + 0.20803807 × BAK1 + 0.70268334 × VCP + 0.27895531 × ACTN3.
Based on the LASSO model, the AUC values for the GSE87466, GSE38713, GSE59071, and GSE47908 cohorts were 1, 0.903, 0.963, and 0.902, respectively. Further, the Kappa statistic was employed to evaluate the heterogeneity between predicted and actual outcomes, revealing that the novel diagnostic model exhibited robust predictive power across all four cohorts (GSE87466: Kappa = 1, P < 0.001; GSE38713: Kappa = 0.652, P < 0.001; GSE59071: Kappa = 0.544, P < 0.001; GSE47908: Kappa = 0.623, P < 0.001; Fig. 4B–E).
Verifying the efficacy of the diagnostic model in external cohorts
To further ascertain the diagnostic capabilities of the model, we included four external cohorts: GSE53306, GSE13367, GSE48958, and GSE126124. The samples from the first three cohorts were derived from intestinal mucosal tissue, while the GSE126124 cohort utilized peripheral blood samples from patients and healthy controls. Using the same methodology, we computed the predictive results of the four external cohorts across the 40 models. Ultimately, the LASSO-based diagnostic model consistently showcased commendable diagnostic prowess (Fig. 5A) with the following results: GSE53306 (AUC = 0.798, Kappa = 0.360, P = 0.024, Fig. 5B), GSE13367 (AUC = 0.782, Kappa = 0.340, P = 0.006, Fig. 5C), GSE48958 (AUC = 0.873, Kappa = 0.529, P = 0.007, Fig. 5D). For the GSE126124 cohort, although the AUC value was only 0.694, considering that these samples were derived from peripheral blood, its predictive capability near 0.7 remains a valuable asset for clinical diagnosis (Kappa = 0.272, P = 0.003, Fig. 5E).
Expression of 13 characteristic genes in UC
The LASSO logistic regression analysis incorporated 13 genes into the model, namely SYK, CALR, GATA5, FLRT2, IRAK3, DUSP26, SPINK5, PTPN6, LCN2, ASS1, BAK1, VCP, and ACTN3. To elucidate the conditions of these 13 genes, their expression differences between UC patients and healthy controls in a training cohort amalgamated from four cohorts were initially assessed. Notably, 11 out of these 13 genes exhibited significantly heightened expression in UC patients, while DUSP26 manifested diminished expression and ACTN3 showcased no significant difference (Fig. 6A). We selected three significantly upregulated genes in UC, namely LCN2, ASS1, and IRAK3, for further validation in external cohorts. In the GSE13367 dataset, the expression of three genes was notably elevated in UC patients compared to healthy controls. Although these genes exhibited higher expression in inflamed UC patients, there was no statistically significant difference when compared to non-inflamed patients (Fig. 6B). In the GSE48958 dataset, the expression trends of these genes mirrored the previously described patterns, with LCN2 showing the highest expression in active UC patients (Fig. 6C). In the GSE53360 dataset, we observed that LCN2 also had the highest expression in active UC patients, with significant differences when compared both to non-active patients (P < 0.05) and to healthy controls (P < 0.05) (Fig. 6D). These findings indicate that LCN2, ASS1, and IRAK3 are crucial markers distinguishing between healthy controls and UC patients.
Correlation between key biomarkers and immune cell infiltration
A plethora of research concurs that immune system dysregulation is a critical factor precipitating the onset of UC. Consequently, a comparison was made between all included normal controls and UC patients to discern differences in immune cell distribution. It was discerned that the majority of immune cells exhibited pronounced expression elevation in UC patients, most notably myeloid-derived suppressor cell (MDSC), Neutrophil, and central memory CD4 T cells (Fig. 7A). Subsequent investigations evaluated the relationship between LCN2, ASS1, IRAK3, and immune cell infiltration in all UC patients. All three genes exhibited positive correlations with the majority of immune cells, with the strongest associations found with activated dendritic cells, neutrophils, and immature dendritic cells (Fig. 7B–D). Additionally, correlations were established between LCN2 and Effector memory CD8 T cells as well as Gamma delta T cells (Fig. 7B); ASS1 and Type 17T helper cells (Fig. 7C); and IRAK3 with Type 1T helper cells and Gamma delta T cells (Fig. 7D).
It was observed that all three genes had a pronounced positive correlation with activated dendritic cells. Therefore, further analysis delved into the relationship between these genes and different UC disease statuses. In the GSE13367 cohort, the strongest correlations in active UC patients with activated dendritic cells were noted (LCN2: R = 0.72, P = 0.0024; ASS1: R = 0.61, P = 0.014; IRAK3: R = 0.71, P = 0.0029; Fig. 8A). In the GSE48958 cohort, only IRAK3 exhibited a positive correlation with activated dendritic cells in active UC patients (R = 0.82, P = 0.034, Fig. 8B).
Discussion
UC remains a focal point in gastroenterological research due to its multifaceted etiological profile and the intricacies associated with its management25,26. Developing a robust diagnostic model that can accurately differentiate between UC patients and healthy individuals could offer a paradigm shift in the management of this condition. The application of machine learning in biomedical research has surged exponentially in recent years, with its prowess in data handling and pattern recognition being especially transformative for complex datasets27,28,29. The present study exemplifies this paradigm shift by utilizing machine learning to sift through intricate gene expression profiles, leading to the elucidation of a diagnostic model for UC.
In the current study, four training cohorts were utilized to identify key pathways and genes, leading to the construction of the prediction model in GSE87466, followed by internal validation and subsequent external validation. GSE87466, comprising the largest sample size, was selected for model construction. We did not amalgamate all four training cohorts into a single extensive dataset due to the potential substantial batch effects within the cohorts. For the external validation cohort, GSE126124 comprises samples from peripheral whole blood, whereas the training cohort GSE87466 includes samples from mucosa. In summary, this study encompasses training, internal validation, external validation, and further validation with peripheral whole blood samples to ensure the diagnostic model's robustness and credibility. Central to our findings is the delineation of specific cellular pathways and genes that are distinctly altered in UC patients. Notably, the pathways identified in our study encompass a broad spectrum of cellular processes, ranging from cell adhesion to immune signaling, reinforcing the notion of UC as a systemic ailment with widespread cellular repercussions30,31. In subsequently study, the iterative analysis of 83 genes across 101 algorithm combinations is testament to this capability. It is noteworthy that out of these numerous combinations, a set of 40 viable diagnostic models emerged, showcasing the flexibility and rigor of machine learning in generating a suite of models tailored to the data's nuances. The Average AUC value of 0.942 achieved by the LASSO model, and its robust predictive power across all four cohorts, underscore its efficacy. In addition, the model demonstrating remarkable diagnostic precision across multiple external validation cohorts. The strength of the model, as evidenced by its high average AUC value, suggests that gene expression profiling can serve as a formidable tool in the diagnostic arsenal against UC. Furthermore, the robustness of this model, even when applied to peripheral blood samples, underscores its potential versatility and broad applicability in clinical settings.
The incorporation of machine learning also allowed for the identification of 13 key genes, which upon further validation, revealed LCN2, ASS1, and IRAK3 as pivotal markers distinguishing between healthy individuals and UC patients. It is well-established that UC is characterized by chronic inflammation of the colon, predominantly driven by an aberrant immune response32. In this study, the robust correlation observed between the expression levels of LCN2, ASS1, and IRAK3 and specific immune cell populations, particularly activated dendritic cells, highlights the intertwined relationship between these genes and immune cell activity in UC33. Dendritic cells are known to play a pivotal role in antigen presentation and initiation of adaptive immune responses, their activation could subsequently lead to the recruitment and activation of other immune cells, perpetuating the inflammatory cascade observed in UC34,35. Notably, LCN2 has been previously documented to play a role in innate immunity, being associated with neutrophil function and acting as a bacteriostatic agent by sequestering iron, which in turn limits bacterial growth36,37,38. Although we observed that IRAK3 is correlated with the infiltration of activated dendritic cell, however, it can not distinguish the disease status of UC, the potential reason is that in the UC cases, inflammation and tissue remodeling of uninflamed (inactive) regions similar to inflamed (active) regions, they all have the increased expression of TGF -β, vimentin, and α-SMA39.
Combining various methods in a multi-faceted research setup presents a range of benefits and drawbacks. One significant advantage is the increased robustness and reliability of the results. By integrating different techniques, such as machine learning algorithms and gene expression analyses, researchers can cross-validate findings, reducing the likelihood of false positives and enhancing the overall confidence in the results. Additionally, the flexibility in combining methods can facilitate the discovery of novel biomarkers and therapeutic targets, providing a holistic view of disease mechanisms and potential intervention points. However, there are inherent drawbacks to this approach. The complexity of managing and integrating diverse datasets and methodologies can be challenging, requiring advanced computational skills and substantial computational resources. The risk of overfitting increases with the use of multiple machine learning models, where a model may perform exceptionally well on training data but poorly on unseen data, thus limiting its generalizability. Furthermore, while combining methods can highlight potential biomarkers or pathways, it often does not provide mechanistic insights into their roles, necessitating further functional studies to elucidate their contributions to disease pathogenesis. Therefore, while the integration of multiple methods can significantly advance our understanding and management of diseases like UC, it requires careful consideration of these potential limitations.
While the advantages of machine learning are manifold, it is vital to approach its results with a measure of caution, and there are several limitations for the current study. First, this study utilized a relatively small cohort of patients. Larger and more varied cohorts are necessary to validate the diagnostic model across different demographic groups. Second, the external validation cohorts primarily consisted of mucosal tissue samples, with only one cohort (GSE126124) derived from peripheral blood. The diagnostic model's performance in blood samples was lower (AUC = 0.694) compared to mucosal samples, indicating the need for further refinement and validation in non-invasive sample types like blood. Third, while the study identified several key genes and pathways associated with UC, it did not provide detailed mechanistic insights into how these genes contribute to the disease's pathogenesis. Functional studies are necessary to elucidate the biological roles of these genes and their potential as therapeutic targets.
Conclusion
In conclusion, our research epitomizes the transformative potential of machine learning in the realm of UC research, offering hope for more accurate and early diagnosis. As we stand on the cusp of a new era in personalized medicine, integrating machine learning insights with traditional biomedical research could pave the way for novel therapeutic avenues and improved patient outcomes. Future studies should prioritize external validation of these models in diverse populations and delve deeper into the functional roles of identified biomarkers.
Data availability
All the datasets presented in this study can be obtained from the GEO (http://www.ncbi.nlm.nih.gov/geo) database, and details listed in Table 1. Data is provided within the manuscript or supplementary information files and it is available upon request from the corresponding author.
References
Jess, T., Rungoe, C. & Peyrin-Biroulet, L. Risk of colorectal cancer in patients with ulcerative colitis: A meta-analysis of population-based cohort studies. Clin. Gastroenterol. Hepatol. 10, 639–645. https://doi.org/10.1016/j.cgh.2012.01.010 (2012).
Lo, B., Zhao, M., Vind, I. & Burisch, J. The risk of extraintestinal cancer in inflammatory bowel disease: A systematic review and meta-analysis of population-based cohort studies. Clin. Gastroenterol. Hepatol. 19, 1117–1138. https://doi.org/10.1016/j.cgh.2020.08.015 (2021).
Dinallo, V. et al. Neutrophil extracellular traps sustain inflammatory signals in ulcerative colitis. J. Crohns Colitis 13, 772–784. https://doi.org/10.1093/ecco-jcc/jjy215 (2019).
Eckburg, P. B. et al. Diversity of the human intestinal microbial flora. Science 308, 1635–1638. https://doi.org/10.1126/science.1110591 (2005).
Neurath, M. F. Cytokines in inflammatory bowel disease. Nat. Rev. Immunol. 14, 329–342. https://doi.org/10.1038/nri3661 (2014).
Ng, S. C. et al. Worldwide incidence and prevalence of inflammatory bowel disease in the 21st century: A systematic review of population-based studies. Lancet 390, 2769–2778. https://doi.org/10.1016/S0140-6736(17)32448-0 (2017).
Molodecky, N. A. et al. Increasing incidence and prevalence of the inflammatory bowel diseases with time, based on systematic review. Gastroenterology 142, 46–54. https://doi.org/10.1053/j.gastro.2011.10.001 (2012).
Du, L. & Ha, C. Epidemiology and pathogenesis of ulcerative colitis. Gastroenterol. Clin. N. Am. 49, 643–654. https://doi.org/10.1016/j.gtc.2020.07.005 (2020).
Orholm, M. et al. Familial occurrence of inflammatory bowel disease. N. Engl. J. Med. 324, 84–88. https://doi.org/10.1056/NEJM199101103240203 (1991).
Benchimol, E. I. et al. The impact of inflammatory bowel disease in Canada 2018: A scientific report from the canadian gastro-intestinal epidemiology consortium to crohn’s and colitis Canada. J. Can. Assoc. Gastroenterol. 2, S1–S5. https://doi.org/10.1093/jcag/gwy052 (2019).
Khanna, R., Chande, N. & Marshall, J. K. Ozanimod for the treatment of ulcerative colitis. Gastroenterology 162, 2104–2106. https://doi.org/10.1053/j.gastro.2022.01.033 (2022).
Alsoud, D., Verstockt, B., Fiocchi, C. & Vermeire, S. Breaking the therapeutic ceiling in drug development in ulcerative colitis. Lancet Gastroenterol. Hepatol. 6, 589–595. https://doi.org/10.1016/S2468-1253(21)00065-0 (2021).
Gros, B. & Kaplan, G. G. Ulcerative colitis in adults: A review. JAMA 330, 951–965. https://doi.org/10.1001/jama.2023.15389 (2023).
Dinca, R. & Sturniolo, G. Biomarkers in IBD: What to utilize for the diagnosis? Diagnostics (Basel) 13, 1. https://doi.org/10.3390/diagnostics13182931 (2023).
Saurabh, R., Fouodo, C. J. K., Konig, I. R., Busch, H. & Wohlers, I. A survey of genome-wide association studies, polygenic scores and UK Biobank highlights resources for autoimmune disease genetics. Front. Immunol. 13, 972107. https://doi.org/10.3389/fimmu.2022.972107 (2022).
Caliendo, G. et al. Biological, genetic and epigenetic markers in ulcerative colitis. Adv. Med. Sci. 68, 386–395. https://doi.org/10.1016/j.advms.2023.09.010 (2023).
Vatn, S. S. et al. Mucosal gene transcript signatures in treatment naive inflammatory bowel disease: A comparative analysis of disease to symptomatic and healthy controls in the European IBD-character cohort. Clin. Exp. Gastroenterol. 15, 5–25. https://doi.org/10.2147/CEG.S343468 (2022).
Crooke, P. S. et al. Using gene expression data to identify certain gastro-intestinal diseases. J. Clin. Bioinf. 2, 20. https://doi.org/10.1186/2043-9113-2-20 (2012).
Huo, Y. et al. Subpathway analysis of transcriptome profiles reveals new molecular mechanisms of acquired chemotherapy resistance in breast cancer. Cancers (Basel) 14. https://doi.org/10.3390/cancers14194878 (2022).
Satam, H. et al. Next-generation sequencing technology: Current trends and advancements. Biology 12, 1. https://doi.org/10.3390/biology12070997 (2023).
Hanzelmann, S., Castelo, R. & Guinney, J. GSVA: Gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 14, 7. https://doi.org/10.1186/1471-2105-14-7 (2013).
Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U S A 102, 15545–15550. https://doi.org/10.1073/pnas.0506580102 (2005).
Charoentong, P. et al. Pan-cancer immunogenomic analyses reveal genotype-immunophenotype relationships and predictors of response to checkpoint blockade. Cell Rep. 18, 248–262. https://doi.org/10.1016/j.celrep.2016.12.019 (2017).
Hazra, A. & Gogtay, N. Biostatistics series module 3: comparing groups: Numerical variables. Indian J. Dermatol. 61, 251–260. https://doi.org/10.4103/0019-5154.182416 (2016).
Siegel, C. A. & Bernstein, C. N. Identifying patients with inflammatory bowel diseases at high vs low risk of complications. Clin. Gastroenterol. Hepatol. 18, 1261–1267. https://doi.org/10.1016/j.cgh.2019.11.034 (2020).
Nadeem, M. S., Kumar, V., Al-Abbasi, F. A., Kamal, M. A. & Anwar, F. Risk of colorectal cancer in inflammatory bowel diseases. Semin. Cancer Biol. 64, 51–60. https://doi.org/10.1016/j.semcancer.2019.05.001 (2020).
Bokma, W. A. et al. Predicting the naturalistic course in anxiety disorders using clinical and biological markers: A machine learning approach. Psychol. Med. 52, 57–67. https://doi.org/10.1017/S0033291720001658 (2022).
Sundar, R. et al. Machine-learning model derived gene signature predictive of paclitaxel survival benefit in gastric cancer: Results from the randomised phase III SAMIT trial. Gut 71, 676–685. https://doi.org/10.1136/gutjnl-2021-324060 (2022).
Mc Ardle, A. et al. Identification and evaluation of serum protein biomarkers that differentiate psoriatic arthritis from rheumatoid arthritis. Arthritis Rheumatol. 74, 81–91. https://doi.org/10.1002/art.41899 (2022).
Samoila, I., Dinescu, S. & Costache, M. Interplay between cellular and molecular mechanisms underlying inflammatory bowel diseases development-a focus on ulcerative colitis. Cells 9, 1. https://doi.org/10.3390/cells9071647 (2020).
Kaur, A. & Goggolidou, P. Ulcerative colitis: Understanding its cellular pathology could provide insights into novel therapies. J. Inflamm. 17, 15. https://doi.org/10.1186/s12950-020-00246-4 (2020).
Zou, J., Liu, C., Jiang, S., Qian, D. & Duan, J. Cross talk between gut microbiota and intestinal mucosal immunity in the development of ulcerative colitis. Infect. Immun. 89, e0001421. https://doi.org/10.1128/IAI.00014-21 (2021).
Penrose, H. M. et al. Ulcerative colitis immune cell landscapes and differentially expressed gene signatures determine novel regulators and predict clinical response to biologic therapy. Sci. Rep. 11, 9010. https://doi.org/10.1038/s41598-021-88489-w (2021).
Yang, Z. J. et al. Functions of dendritic cells and its association with intestinal diseases. Cells 10, 1. https://doi.org/10.3390/cells10030583 (2021).
Bates, J. & Diehl, L. Dendritic cells in IBD pathogenesis: An area of therapeutic opportunity?. J. Pathol. 232, 112–120. https://doi.org/10.1002/path.4277 (2014).
Singh, V. et al. Microbiota-inducible innate immune, siderophore binding protein lipocalin 2 is critical for intestinal homeostasis. Cell Mol. Gastroenterol. Hepatol. 2, 482–498. https://doi.org/10.1016/j.jcmgh.2016.03.007 (2016).
Bachman, M. A., Miller, V. L. & Weiser, J. N. Mucosal lipocalin 2 has pro-inflammatory and iron-sequestering effects in response to bacterial enterobactin. PLoS Pathog. 5, e1000622. https://doi.org/10.1371/journal.ppat.1000622 (2009).
Xiao, X., Yeoh, B. S. & Vijay-Kumar, M. Lipocalin 2: An emerging player in iron homeostasis and inflammation. Annu. Rev. Nutr. 37, 103–130. https://doi.org/10.1146/annurev-nutr-071816-064559 (2017).
Jun, Y. K. et al. Molecular activity of inflammation and epithelial-mesenchymal transition in the microenvironment of ulcerative colitis. Gut Liver https://doi.org/10.5009/gnl230283 (2024).
Acknowledgements
The authors acknowledge support from 2023 Wan-nan Medical College Scientific Research Project (No. WK2023ZQNZ52), the Key Research Project of Wan-nan Medical College (No. WK2022ZF03) and Wuhu City Science and Technology Project (No. 2021cg36). Thanks to ChatGPT for polishing the language and grammar of the article.
Author information
Authors and Affiliations
Contributions
Jing Wang: Conceptualization, methodology, formal analysis, writing—original draft preparation; Lin Li: Data curation, software, validation, visualization; Pingbo Chen: Supervision, funding acquisition, project administration, writing—reviewing and editing; Chiyi He: Supervision, writing—reviewing and editing; Xiaoping Niu: Supervision, funding acquisition, writing—reviewing and editing. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, J., Li, L., Chen, P. et al. Exploration and verification a 13-gene diagnostic framework for ulcerative colitis across multiple platforms via machine learning algorithms. Sci Rep 14, 15009 (2024). https://doi.org/10.1038/s41598-024-65481-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-65481-8
- Springer Nature Limited