Introduction

Acute Myeloid Leukemia (AML), the most common type of leukemia in adults, is a malignant clonal disease that originates from hematopoietic stem cells with clonal evolution and genetic heterogeneity1. Clonal expansion and differentiation arrest of bone marrow hematopoietic stem cells are the two main characteristic features of AML, with the clinical symptoms of the disease including fever, infection, bleeding, and anemia, among others. According to research, AML incidence often increases with age, and most AML patients have a poor overall prognosis and high risk of relapse2. Statistically, AML has an incidence rate of approximately 4.2/100,000, with over 20,000 cases reported yearly in the USA3. The 3 + 7 regimen (3 days of anthracycline + 7 days of cytarabine) induction chemotherapy has long been the gold standard of AML treatment4. Furthermore, consolidation therapies such as the 3 + 7 sequential allogeneic hematopoietic stem cell transplantation (Allo-HSCT), have been used for a considerable period as a standard treatment option, particularly for AML patients who can withstand a strong chemotherapy regimen5. Notably, the introduction of many novel targeted drugs into the clinical treatment of AML has changed the disease’s overall treatment landscape. Although clinical treatment refinement has significantly improved the Survival Rate (SR) among AML patients, the mortality rate associated with the disease remains high, with chemotherapy resistance and relapse as the most significant challenges. Two long-term follow-up studies involving elderly patients revealed that the five-year Overall Survival (OS) of adult AML (non-acute promyelocytic leukemia) patients after relapse was only 10%, while the 3- to 5-year OS after transplantation was estimated to be 15–25%6. In this regard, actively exploring the molecular markers associated with AML pathogenesis is imperative as it will guide AML-targeted drug development and clinical treatment.

For a range of therapeutic targets, both monoclonal antibodies and specific targeted molecular drugs should be examined at the single-cell level. Advances in molecular analysis7 have contributed significantly to the treatment of malignant tumors in the hemopoietic system, especially acute leukemia. Single-cell sequencing, a technique that entails a high-throughput sequencing of genomes, transcriptomes, and epigenomes at the single-cell level8,9, can reveal the gene structure and gene expression status of a single cell, reflecting intercellular heterogeneity. Through a combination of high-throughput sequencing and bioinformatics tools, single-cell sequencing enables the analysis of the gene expression of multiple cell types in tissues, aiding the identification of rare and highly heterogeneous cell populations10. Previous research11,12,13,14,15,16 has unraveled deeper features of the bone marrow environment, elucidated the cellular hierarchy and transcriptional heterogeneity in AML, and enhanced the understanding of disease progression and interactions between leukemia cells and immunocytes.

The Tumor Microenvironment (TME), the internal environment in which tumor cells live and develop, is critically involved in tumor occurrence, progression, and immune escape17. The TME has two main components: (1) The immune microenvironment, which encompasses Myeloid-Derived Suppressor Cells (MDSCs), Tumor-Associated Macrophages (TAMs), Tumor-Associated Neutrophils (TANs), Dendritic Cells (DCs), T cells, B cells, and Natural Killer (NK) cells; and (2) The non-immune microenvironment, which encompasses Cancer-Associated Fibroblasts (TAFs), the Extracellular Matrix (ECM), Mesenchymal Stem Cells (MSCs), and various secretory factors18,19,20,21. In this regard, a tumor is a complex hybrid entity comprising malignant cells, immunocytes, and stromal cells with intra- and inter-tumor heterogeneity. Notably, the Intra-Tumor Heterogeneity (ITH) cell subgroup is one of the major causes of drug resistance and relapse in AML22,23. By revealing the heterogeneity of tumor cell subsets and facilitating the comprehensive analysis of the highly complex TME, single-cell sequencing has enabled researchers to explore the correlation between TME subgroups and tumor prognosis, distinguish the existence of drug-resistant subgroups of tumor cells, and unravel the complexity of the AML cellular heterogeneity24. However, the identification of key genes involved in the pathogenesis of AML remains crucial for the disease’s diagnosis and prognosis. Herein, we analyzed samples from eight AML patients on a single-cell genetic dimension and then constructed a prognostic risk-scoring model and verified its reliability using independent external datasets and in vitro experiments.

Materials and methods

Data download and filtering

Single-cell data (GSE154109) were obtained from public cancer single-cell sequencing databases. Specifically, public gene expression data were downloaded from the TCGA (https://portal.gdc.cancer.gov) and GEO (https://www.ncbi.nlm.nih.gov) databases. Additionally, data on transcriptional regulators were downloaded from databases with clinical and survival information for AML cases, such as TRRUST. Data on ligands and receptors were comprehensively obtained from a Cell–Cell Interaction (CCI) database (https://baderlab.org).

The AML single-cell sequencing raw data reads were processed using the 10× genomics companion Cell Ranger (v2.2.0), and reads were mapped to the human genome (Hg19 version). Using the Seurat R package, the count matrices of each library were tagged with library batch IDs and combined in independent experiments. To assess the quality of single-cell data, the cells were screened by the Seurat R package based on three screening criteria: (1) Number of genes detected per cell; (2) number of UMIs expressed per cell, and (3) Mitochondrial content. Specifically, three threshold parameters (nGene, nUMI, and mitochondrial gene expression percentage) were used for screening. The threshold parameters were: (1) An nGene range of 2000–9000; (2) nUMI < 40,000; and (3) A mitochondrial gene expression percentage < 0.5%. The data were normalized as outlined in the Seurat R package manual (https://satijalab.org/seurat/h). The JackStraw and/or RunCCA functions in the Seurat R package were used to correct batch defects. Platform differences were corrected using the RunCCA function.

Cell recognition and clustering analysis

The Seurat R package was used to identify cells labelled with different cell categories. The number of dispersal genes and parameters for determining K neighbors was adjusted to detect the identifiable cell clusters. Subsequently, immunocyte clusters were identified using igraph-based category prediction methods such as infograph, walktrap, and multivel. Based on known immune-associated cellular biomarkers, clustering analysis was performed using tSNE, and then the results were integrated into the Seurat R package for cluster annotation.

The FindMarkers function (Wilcoxon rank sum test) was employed to determine differential gene expressions between immunocyte clusters (the minimum expression threshold was set to be met in 25% of cells). Different cell cycle phases were inferred using the cell cycle scoring function, revealing the relative expression of a substantial proportion of G2-M- and S-phase genes. The Pearson Correlation Coefficient (PCC) between two gene clusters was determined using the CellPlot function. Functions of top Differentially Expressed Genes (DEGs) in each cluster were then inferred by GO analysis. The Benjamini & Hochberg (BH) method was used for multiple hypothesis testing. The enrichment was considered statistically significant if the corrected false detection rate by multiple testing was < 0.05.

Establishment of the ligand–receptor network and the multi-factor network

Ligand–receptor and multi-factor networks were constructed using the ligand–receptor pairs available in the CCI database and the TF-target gene pairs available in the TRRUST database, respectively.

Enrichment analysis

The obtained feature genes were annotated and subjected to enrichment analysis using the clusterProfiler R package. Enrichment analyses included GO and KEGG pathways, and GO annotation was performed on Biological Processes (BPs), Molecular Functions (MFs), and Cellular Components (CCs).

Univariate Cox regression analysis

Prognostic feature genes were screened for in immune-infiltrating cells through the Univariate Cox regression analysis of hub genes. The filter condition for screening for prognosis-related genes was p < 0.05.

Construction and validation of a prognosis risk model

Univariate Cox regression analysis was used to identify differentially expressed prognosis-related markers. Subsequently, redundant prognosis factors were screened out using LASSO to obtain prognosis-related genes. The correlation coefficients of these genes were then obtained to construct a risk prognosis model. Based on the risk scores yielded by the model for each patient, the median was used as the cut-off value, and the samples were divided into two groups: high-risk and low-risk. The proposed model’s predictive power for 1-, 3- and 5-year survival was assessed using time-dependent ROC curves, and the survival curves of the high- and low-risk groups were also analyzed. The prognosis risk model was validated using external datasets (TCGA and GEO datasets).

Raw AML sample collection and qt-PCR analysis

Primary AML patients at the First People’s Hospital of Hangzhou donated the bone marrow or peripheral blood samples used herein. The patients also gave informed consent for their clinical information to be used in this study. The Ethics Committee of the First People’s Hospital of Hangzhou approved the study protocol. Total RNA was extracted by chloroform, and cDNA was reverse transcribed using BeyoRT III cDNA First Strand Synthesis Premix (5×). We performed qRT-PCR using BeyoFast SYBR Green qPCR Mix (2×, Low ROX) per the manufacturer’s instructions, with B-actin as the internal reference gene. The B-actin and ETS2 primers were forward (5′-CCTTCCTGGGCATGGAGTC-3′), reverse (5′-TGATCTTCATTGTGCTGGGTG-3′) and forward (5′-AGAGACTGACGAGTGCGGTG-3′), reverse (5′-CGAAATCATTCATCCTGCCGCT-3′), respectively. Unless otherwise specified, all analysis results with p ≤ 0.05 were considered statistically significant.

Gene expression and survival analysis of AML patients

Expression of ETS2, CCL5, and IL2RA genes in AML patients was assessed by qRT-PCR using eight resistant and eight sensitive fresh or frozen AML cells and four healthy blood donors as controls. Then the AML patients were divided into high expression group and low expression group according to the expression levels of ETS2, CCL5 and IL2RA genes, and the survival data of the patients were collected and K-M survival analysis was performed, and the survival curve was drawn. SPSS25.0 statistical software was used for statistical analysis, and GraphPadPrism9 software was used for drawing.

Ethical approval

This study was approved by the Ethics Committee of the First People’s Hospital of Hangzhou and the methods were carried out in accordance with the approved guidelines. All the patients have been informed and signed informed consent before the experiments.

Results

Single-cell data pre-processing and cell clustering analysis

Herein, 12 marrow immunocyte specimens (from eight AML patients and four healthy individuals) were retrieved from public databases and analyzed. Data pre-processing yielded 18,251 single cells. The quality of some of the single cells may have been poor due to technical reasons and batch effect, which were appraised and filtered by observing the number of genes or UMIs in each barcode. Cell clustering and UMAP downscaling clustering analysis were then performed per the Principal Component Analysis (PCA) results. Following that, cell types were further annotated using known cell-type-specific gene markers. Thirty-four clusters and eight cell subsets were identified and annotated (Fig. 1A–C), revealing four classes of T cells [C0 (CD8+ T), C5, C9, and C32], three classes of B cells (C3, C22, and C33), six classes of monocytes (C1, C11, C13, C18, C19, and C30), 11 classes of progenitor cells (C2, C8, C12, C14, C16, C21, C24, C27–29, and C31), and seven classes of DCs (C4, C6, C10, C15, C17, C23, and C26), as well as one type of basophil cells (C20) and two types of NK cells (C7 and C25). Notably, T cells accounted for a higher proportion of the clusters. Additionally, the proportions of the 34 cell types were examined in different patients, revealing considerable variations in the composition of each cell type across the 12 patient samples.

Figure 1
figure 1

(A–C) Cell type identification; (A,B) Heterogeneity of tumor cells (T cells and NK cells). (C) Heterogeneity of tumor (AML) and normal cells. (D) Identification of DEMs; specifically expressed myeloid cell markers with different expression patterns among the 34 clusters as defined by single-cell sequencing, including CSF3R, S100A9, CD14, CLEC7A, CD3E, MCL1, S100A8, CD79A, CD68 and FCER1A. (E) Identification of DEGs; the number of specifically expressed genes with different expression patterns among the 34 clusters as defined by single-cell sequencing. Red and blue-violet indicate high and low expression, respectively. The colored squares on the top are clusters 0–33 from left to right, and the names of DEGs are written on the left margin. (F) Functional status analysis of immunocytes with specific expression markers in seven AML subgroups as defined by single-cell sequencing. (G) Identification of immunocyte marker gene expression patterns; the number of immunocyte marker genes with different expression patterns among the 34 clusters as defined by single-cell sequencing. Red and blue-violet indicate high and low expression, respectively. The colored squares on the top are clusters 0–33 from left to right, and the names of DEGs are written on the left margin.

Immunocyte functional status and marker gene expression analysis

The obtained DEGs and Differentially Expressed Markers (DEMs) were subjected to functional analysis, including all immunocytes (Fig. S1). Cluster classification, which can fully demonstrate the heterogeneity of each cell type, was then performed. Significant gene expression differences were observed among different cell subsets. The heatmap revealed the top marker genes of each cell type, such as T cell subsets, which mostly expressed TRAC, CD3D, and CD3E (Fig. 1E). Among the 34 clusters, CSF3R, S100A9, CD14, CLEC7A, CD3E, MCL1, S100A8, CD79A, CD68, and FCER1A were the specifically expressed markers, with CSF3R and MCL1 being expressed in almost every cluster. Furthermore, a significant variation was observed in the expression of different marker genes among clusters (Fig. 1D). The functional status analysis of T cells and enrichment analysis of the HALLMARK gene set revealed that they were primarily involved in protein secretion, cholesterol homeostasis, and E2F targets, among other pathways (Fig. 1F), all of which were closely related to tumor cell growth and development. Additionally, immunocyte marker genes were analyzed accordingly, revealing significant differences in marker gene expression among different immunocytes, with T cells highly expressing CD3E, IL7R, and CCR7, of which CD3E was particularly important for T cell activation and development (Fig. 1G).

Establishment of the ligand–receptor and multi-factor networks

A Protein–Protein Interaction (PPI) network was constructed using the CCI database to demonstrate interactions between DEG-encoded proteins. Herein, intercellular interactions were crucial for understanding tissue construction. As shown in ligand–receptor network interaction maps, multiple cells expressed many intricate ligand–receptor pairs (Fig. S2). Furthermore, some ligand–receptor interactions were highly pronounced, such as the interaction of CCL5 with EDN1, INS, C5AR1, IAPP, ACKR1, CXCL2, SOD1, CXCR3, PF4, BDNF, IFNG, CSF1, AGT, CCR3, AHR, VCAN, AOC3, CXCL8, CCR7, CCL2, CCR1TNF, ACKR4, IL8, IL1B, ALB, TGFB1, ACKR2, APP, and CCR4 (Fig. 2A). Subsequently, using 9396 pairs of human TF-mRNA regulatory relationships included in the TRRUST v2 database as the interactions background, we constructed an immunocyte multi-factor network (Fig. S3) and searched for Transcription Factors (TFs) associated with the ligand-receptor network (Fig. 2B).

Figure 2
figure 2

(A) Ligand–receptor interaction network; red, green, and blue represent the ligand, labeled ligand, and ligand–receptor, respectively. (B) Multi-factor network diagram; per the colors in the caption, the ligand, receptor, and TF are shown in order in the module, and the remaining genes are TF target genes. (C) Enrichment analysis results of the top 10 hub genes, including GO BPs, MFs, CCs and KEGG pathway.

Functional and pathway analysis

Annotation and enrichment analysis of the 333 network hub genes using the R package revealed the functional and pathway information of the top ten genes (Fig. 2C). Specifically, GO analysis revealed that the genes were mainly enriched in BPs, including response to both toxic substances and Oxidative Stress (OS). Regarding MF, the genes were primarily involved in TF binding and RNA polymerase II TF binding. On the other hand, the genes enriched in CCs were mainly involved in the vesicle lumen and TF complex, as well as transcriptional mis-regulation in cancer, the TNF signaling pathway, and other related signaling pathways. Finally, the KEGG enrichment analysis revealed that the selected hub genes were mainly involved in transcriptional dysregulation in cancer and TNF signaling pathways.

Construction and validation of a prognostic risk-scoring model

Seventy-four prognostic feature genes of immune-infiltrating cells were obtained after screening for AML hub genes through univariate Cox regression analysis. The top three prognosis-related genes (CCL5, ETS2, and IL2RA) were subjected to K–M survival analysis, revealing that survival results worsened with their increasing expression (Fig. 3A–C). We observed the prognosis value of the 74 key genes in AML patients and discovered that CCL5 and ETS2 had a Hazard ratio (HR) > 1 in AML patients, implying that they may be risk factors for AML (Fig. 3D). The redundant prognosis factors were then screened out by LASSO, yielding ten prognosis-related genes along with their corresponding correlation coefficients. The genes and their correlation coefficients were then used to construct the prognostic risk-scoring model. Subsequently, the samples were divided into high-risk and low-risk groups based on the risk score yielded by the model for each patient, using the median as the cut-off value (Fig. 3E,F). The predictive efficacy of the prognosis model was assessed using TCGA and GEO (GSE71014) data. Time-dependent ROC curves were used to evaluate the model’s predictive ability for 1-, 3- and 5-year survival, as well as to analyze the survival curves of the high- and low-risk groups. According to the results, GSE71014 (Fig. 4A) in GEO was significantly correlated with survival status, showing an average ROC > 0.6 in 1-, 3- and 5-year ROC curves (p = 0.00044). In the GSE106291 dataset, the prognosis of patients in the high-risk group was significantly worse than that of those in the low-risk group (p = 0.017) (Fig. 4B). On the other hand, TCGA (Fig. S4) showed an average ROC > 0.8 in 1-, 3- and 5-year ROC curves.

Figure 3
figure 3

(A–C) Relationship between different hub genes and survival (time is in days); (A) CCL5; (B) EST2; and (C) IL2RA; (D–F) construction of a risk-scoring model; (D) Random forest map of Univariate Cox regression analysis results; (E) LASSO regression lambda coefficient determination. (F) Genetic regression coefficients for model construction.

Figure 4
figure 4

Validation of the model using external datasets (A,B) and clinical samples (C).

Validation of risk genes

The expression levels of several risk genes were evaluated using RT-qPCR in eight resistant and eight sensitive AML cells, both fresh and frozen, sourced from bone marrow and peripheral blood. This group also included four healthy blood donors as controls. The analysis showed differential expression of ETS2 between the two groups, while no significant differences were observed in CCL5 and IL2RA expression levels (Fig. S5; p = 0.0097). Clinical data from 16 AML patients were also collected and analyzed through Kaplan–Meier survival analysis, revealing a statistically significant difference in overall survival. Patients with higher expression levels of ETS2, along with CCL5 and IL2RA, exhibited poorer survival outcomes (Fig. 4C).

Discussion

Acute Myeloid Leukemia (AML) is a heterogeneous disease with a high mortality rate and poor prognosis. Even though advancements in AML treatments, such as chemotherapy, stem cell transplantation, and drug-targeted therapies, have significantly improved AML prognosis, they still cannot achieve a complete curative effect. Furthermore, the recent development and clinical application of targeted drugs25 have opened up new avenues for AML treatment. However, more research efforts should be devoted to discovering novel potential targets to lay a more robust theoretical foundation for drug development.

With the advent of single-cell sequencing, researchers can now dissect cellular heterogeneity through genomic, epigenomic, transcriptomic, and proteomic analyses. Furthermore, technological advances have contributed to a deeper understanding of the molecular mechanism of healthy and malignant hematopoiesis15. Due to the remarkable heterogeneity among AML patients, some patients still face the challenge of recurrence and require treatments such as allogeneic HSCT, despite the availability of several mutation-based prognosis stratification systems. Notably, adverse treatment effects may also make some patients unsuitable for surgery. In this regard, the continued exploration of genetic models that may influence the prognosis of leukemia is critical, especially for identifying high-risk patients. Moreover, the use of single-cell sequencing to correlate patients’ cellular heterogeneity with prognosis, assess prognosis, and explore therapeutic targets for the disease is extremely valuable in clinical practice (Tables 1, 2, 3).

Table 1 Summary of clinical information of data set samples from TCGA.
Table 2 Summary of single cell dataset samples.
Table 3 Corresponding relationship between cluster and celltype.

The TME is critically involved in tumor occurrence and progression, drug tolerance, and immune infiltration26. Whereas immunocytes inhibit the tumor development process, tumor cells secrete immunomodulatory factors, which constantly remodel the microenvironment, leading to changes that favor tumor growth and invasion. Intercellular communication in the TME is mainly mediated by soluble or membrane-bound ligand–receptor interactions. Intercellular communication in mouse tumor model microenvironments for all cell types, including melanoma and breast and lung cancers, has recently been investigated through scRNA-seq27. Although the TME in AML has been recognized for many years, its critical involvement in disease development, progression, recurrence, and treatment resistance has only recently gained widespread attention28. Multiple AML prognosis models have also been proposed in different studies to assess prognosis, but their accuracy still needs to be improved. In a bone marrow immune microenvironment study29, single-cell analysis identified a cohort of adult and pediatric patients with upregulated inflammatory genes in malignant cells and obtained an inflammation-associated gene score (iScore), which was associated with poor survival outcomes in AML patients.

Research on the development of AML prognostic scoring models has been extensive, with numerous models being proposed. Wang30 constructed prognostic models using public databases and qRT-PCR focused on m6A-related genes in a recent study. Conversely, our research is centered on the differential gene expression among immune cell clusters at the single-cell level, pertinent to the immune microenvironment. We validated our prognostic model using two independent external datasets, and the expression of characterized genes was confirmed through quantitative reverse transcription-polymerase chain reaction, benefiting from a large sample size and enhanced accuracy. In another approach, Wang31 employed RNA-seq data from public databases for WGCNA analysis and developed a risk prognosis model using COX and Lasso regression to predict AML patient outcomes and construct a nomogram. However, this model lacked experimental validation and requires accuracy enhancements. Our model, validated through multiple dimensions, exhibited high accuracy. Distinctively, our study leverages a single-cell perspective to link immune-related genes, creating a prognostically significant model validated across various datasets and clinical samples.

Herein, ten prognosis-related genes were identified after analyzing the immunocytes of AML patients. Based on the risk values, TGFB1, SOCS1, RFXANK, IL2RA, ETS2, and CCL5 were related to poor prognosis, whereas TW1ST1, MYB, MMP14, and ABCA1 were related to good prognosis, with CCL5 and ETS2 identified as risk factors of AML prognosis (HR > 1).

The C–C Chemokine Ligand 5 (CCL5, previously known as RANTES) is a physiological regulatory factor of immunocyte migration that has been identified to constitute a distinct chemokine-releasing cluster in AML32. It promotes tumor development in several ways by binding to the CCR5 receptor as a growth factor, thereby stimulating angiogenesis, modulating the ECM, inducing the recruitment of additional stromal and inflammatory cells, and regulating the immune evasion mechanism33. Additionally, recent research demonstrated that CCL5 mediates resistance to FLT3-TKIs in AML with FLT3-ITD mutations, potentially serving as a biomarker for drug resistance prediction34. Studies suggest that the v-ets avian erythropoietic virus E26 oncogene homologue 2 (ETS2), a downstream target of the Ras/Raf/MAP kinase and phosphatidylinositol 3-kinase/Akt pathways, is one of the pioneering members of the E26 Transformation-Specific (ETS) family found on human chromosome 2135. Several studies agree that E26 overexpression is closely associated with poor prognosis in AML36,37,38,39. The ETS Proto-Oncogene 2 (ETS2) encodes a TF that regulates a series of genes involved in development and apoptosis, and studies have recognized its value in AML prognosis40. Fu et al. reported that the high ETS2 expression group exhibited a significantly lower OS, Event-Free Survival (EFS), and Relapse Free Survival (RFS) compared to the low ETS2 expression group.

Herein, ETS2 showed good predictive value in the proposed prognosis model. As a result, ETS2 was selected, and its expression levels in clinical samples were verified. The PCR results revealed that ETS2 expression differed between healthy and AML samples and significantly between drug-resistant and sensitive patients. In this regard, high ETS2 expression predicted a poor prognosis of AML and was correlated with chemotherapy resistance and refractory recurrence. In other words, high EST2 expression correlated negatively with both OS and patient survival, a finding that was consistent with previous research40.

Furthermore, AML prognosis models are built based on immunocyte infiltration, which is relatively rare in current literature as most AML studies focus on constructing prognosis models based on mutations in the genes themselves. Our findings were also subjected to a robust validation process in which the conclusions obtained in the training set were validated through analysis in two external validation datasets and an in-depth investigation of typing and prognosis at various levels.

Nevertheless, there are several drawbacks to this study. First, the sequencing data were obtained from public databases, and additional studies involving larger clinical cohorts would be needed to verify the accuracy of the proposed prognosis model. Second, WGCNA analysis of variables in the multi-factor network (1829 genes) was not performed during the study.

Conclusions

Three biomarkers (CCL5, ETS2, and IL2RA) of poor prognosis, as well as the possible pathways and mechanisms through which they may play a role in AML evolution, were explored. A prognostic risk-scoring model was then constructed and validated in other independent external datasets and in vitro experiments. Despite the valuable insights this study adds to existing literature, more large-scale randomized controlled clinical trials and in vivo/in vitro experiments are required to further validate the clinical utility of the three hub genes and the proposed prognostic risk-scoring model.