Introduction

Breast cancer (BC) is a multifactorial heterogenous disease which is characterized by uncontrolled cell proliferation1,2. BC is the most prevalent cancer type which primarily affects women contributing a huge health burden on public and individual spendings. Breast cancer accounts for nearly 38.9% of all human cancer types. A GLOBOCAN survey for 2022 found that 11.6% of new cases of BC occurred in females, and the death rate was almost 6.9%3. The prevalence of breast cancer in Asia is about 40%4, and Pakistan reported about 1.38 million cases of breast cancer in 20155. Multiple genetical, hormonal and environmental factors are involved in causing breast cancer. Breast cancer affects mostly the females, originating from germ line mutations. Key genes found to be involved in breast cancer include BRCA1/BRCA26, TP537, PTEN8, STK119, CDH110,11.

Complex and diverse BC subtypes make it difficult to study the underlying pathways and risk factors responsible for the onset of the disease. This necessitates a comprehensive understanding of the various pathways responsible for onset and proliferation of the disease; it also implicates that the genes involved in these pathways could be used for prevention, early detection, and personalized treatment approaches.

Aberrations in the expression of Estrogen Receptor (ER), Progesterone Receptor (PR) and Human epidermal growth factor receptor 2 (HER2) have been often associated to three distinct subtypes of breast cancer which are observed both clinically and in molecular expression of hormonal imbalance. ER+ breast cancer is of considerable significance for several impacting factors associated to its diagnosis, prognosis, and treatment. ER+ breast cancer has been found to typically respond well to endocrine therapy in about 70% of cases12.

On the other hand triple negative breast cancer (TNBC) is another type of BC whose molecular characteristics vary from the aforementioned BC types, exhibiting no significant variation in behavior of estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor (HER-2)13,14. Approximately, 15–20% of all BCs diagnosed lies in the category of TNBC tumors15. Compared with other types of tumors, TNBC tumors have an aggressive appearance, a poor prognosis, and high recurrence rates16,17,18,19,20,21,22,23. Therefore, accurate identification of differentially expressed genes (DEG) networks is needed for comprehensive understanding and distinct characterization of various breast cancer types.

This study is aimed at identifying the potential hub genes that contribute to both ER+ and TNBC development and progression. In order to determine the molecular basis of biological differences, integrated bioinformatics analyses were performed including classification of the BC types based on machine learning models. DEG analyses were performed to delineate the transcriptomic profiles associated uniquely to ER+ and TNBC types on the basis of LogFC and P values; in the end hub gene were identified for both the types of BC which might serve as biomarkers for the disease. Our findings will contribute to better understanding of distinct phenotypes associated to ER+ and TNBC oncogenesis, and the development of novel diagnostic and therapeutic alternatives against the disease.

Materials and methods

RNA Seq datasets of ER+ and TNBC patients were retrieved from ArrayExpress. The datasets were quality checked, aligned; the duplicate reads were removed, and differentially expressed genes were identified on GALAXY suit24. DAVID and Cytoscape were employed to analyze pathways and networks associated with the disease and to determine which genes are involved in the pathogenesis of breast cancer25,26. Machine learning classifiers including Support Vector Machine, Naïve Bayes and k-Nearest Neighbor were employed for generating a classification model to distinguish both the BC subtypes.

Dataset description

ER+ and TNBC RNA Seq datasets were obtained from ArrayExpress27 repository which is a curated database for high-throughput sequenced data. The datasets used in the study are E-GEOD-58135, E-MTAB-4993 and E-GEOD-45419 and the description of dataset is provided in (Table 1). ArrayExpress is linked to European Nucleotide Archive (ENA), a nucleotide database that provides nucleotide sequencing data, sequence assembly information, and functional annotations. The datasets were uploaded on Galaxy server (https://usegalaxy.eu/) for processing via ENA28. An overview of the layout of various processes employed in the study is presented in (Fig. 1).

Table 1 Datasets description.
Figure 1
figure 1

Data processing and analysis workflow.

Data pre-processing

Data preprocessing was done by using FASTQC and FASTQ Groomer on the samples. HISAT2 was used for dataset alignment because of its high efficiency. “MarkDuplicates” was used to compare sequences and reads in the SAM file by measuring 5` positions of the reads or sequences or paired reads. Afterwards, “RmDup” was used to remove the duplicate reads. An RNA-expression analysis of datasets was conducted by “featureCounts”, which counts both DNA and RNA expression.

Identification of differential expressed genes

A quality-controlled normalized data set was used for supervised analysis comparing gene expression levels between ER+ and TNBC samples using DESeq2. Genes with p value < 0.05 and fold change < −1 and > 1 were statistically considered significant.

Machine learning

Expression file data was used to build mathematical models employing supervised machine learning classifiers. Three different supervised ML classifiers including SVM29, Naïve bayes30 and kNN30 were used to predict the accuracy, sensitivity and specificity of the model.

Functional enrichment analysis

Database for Annotation, Visualization and Integrated Discovery (DAVID) (https://david.ncifcrf.gov/) was used for the functional annotation of GO terms and the analysis of KEGG pathway enrichment. DAVID is a widely used resource for evaluating the functional significance of quantitative gene expression profiles25. The analysis of molecular or biological function GO terms and enrichment of pathways analysis was performed for candidate DEGs with a p-value cutoff of < 0.05 were considered significant. An online tool called REVIGO (Available online: http://revigo.irb.hr/) was utilized to summarize and visualize long lists of GO terms31. The GO terms were clustered and represented in a scatter plot using a semantic similarity measure.

Network analysis

To evaluate the interactive relationships among DEGs, STRING (Available online: https://string-db.org/) was utilized to construct a network of PPI (protein-protein interactions)32,33. The cutoff standard was set to a confident interaction score of > 0.4 to eliminate PPI interactions that are inconsistent. Thus, a PPI network with a strong degree of confidence was obtained. The STRING tool results were then combined with Cytoscape software34 to visualize PPI interactions of statistically significant DEGs35. Cytohubba was used to constructs a sub-network of hub genes based on maximal clique centrality (MCC) algorithm in such a way that molecular species are represented as nodes and their intermolecular interactions are known as links or edges between those nodes36. Thickness of lines between nodes and edges represents the affinity of interaction. Thicker the line stronger will be the interaction and vice versa.

Expression of hub genes

Using cancer data analysis portal (UALCAN, (https://ualcan.path.uab.edu/analysis.html) a web-based tool for analyzing hub gene expression, and clinical data from The Cancer Genome Atlas (TCGA), a box and whisker plot was generated showing gene expression levels in different cancers and their subtypes at various levels of sub-stages37. CDK1,CDC20,CDCA8,RRM2,NDC80,CEP55,CENPF,BUB1,TTK and AURKA were significantly overexpressed in breast cancer tissues based on menopause status than in normal tissues38.

Ethics approval and consent to participate

We further confirm that any aspect of the work covered in this manuscript has not involved human patients and thus requires no ethical approval of any relevant body.

Results

Result of differential expression

The raw read data was aligned against Hg38Chr using HISAT2; the duplicates were identified and removed using MarkDuplicates and RmDup, respectively. R package DESeq2 was employed to figure out the differentially expressed genes from feature count files of SAM format. DESeq2 generated histogram, MA and PC plot for each dataset shown in (Figs. 24). The common DEG’s among three RNA Seq datasets were obtained by Venny tool (https://bioinfogp.cnb.csic.es/tools/venny/)39, 1730 overlapping genes were identified among three datasets as shown in (Fig. 5).

Figure 2
figure 2

The PC plot (A), Dispersion estimates (B), histogram (C) and MA plot (D) were created by DESeq2 tool of E-GEOD-45419 dataset. (A) PC plot shows two phenotypes: ER+ and TN. They are grouped on the basis of expression. (B) Dispersion estimates quantify the level of variability in gene expression across samples. Blue dots represent low dispersion estimates of genes and it shows the gene expression is relatively stable while the black dots represent high dispersion estimates. The red line shows the mean or median dispersion estimates. The blue dot close to red line indicated stable expression and the black dots close to red line suggests that the expression values are more variable. (C) Histogram shows the DEG’s grouped into bins or the frequency of genes. (D) MA plot the differences between measurements based on ER+ and TN by transforming the data by using log ratio and mean average. The red color shows the dispersion of differentially expressed genes while grey color shows no variation.

Figure 3
figure 3

The PC plot (A) Dispersion estimates (B) histogram (C) and MA plot (D) were created by DESeq2 tool of E-MTAB-4993 dataset. (A) PC plot shows two phenotypes: ER+ and TN. They are grouped on the basis of expression. (B) Dispersion estimates quantify the level of variability in gene expression across samples. Blue dots represent low dispersion estimates of genes and it shows the gene expression is relatively stable while the black dots represent high dispersion estimates. The red line shows the mean or median dispersion estimates. The blue dot close to red line indicated stable expression and the black dots close to red line suggests that the expression values are more variable.(C) Histogram shows the DEG’s grouped into bins or the frequency of genes. (D)MA plot the differences between measurements based on ER+ and TN by transforming the data by using log ratio and mean average. The red color shows the dispersion of differentially expressed genes while grey color shows no variation.

Figure 4
figure 4

The PC plot (A) Dispersion estimates (B) histogram (C) and MA plot (D) were created by DESeq2 tool of E-MTAB-58135 dataset. (A) PC plot shows two phenotypes: ER+ and TN. They are grouped on the basis of expression. (B) Dispersion estimates quantify the level of variability in gene expression across samples. Blue dots represent low dispersion estimates of genes and it shows the gene expression is relatively stable while the black dots represent high dispersion estimates. The red line shows the mean or median dispersion estimates. The blue dot close to red line indicated stable expression and the black dots close to red line suggests that the expression values are more variable. (C) Histogram shows the DEG’s grouped into bins or the frequency of genes. (D) MA plot the differences between measurements based on ER+ and TN by transforming the data by using log ratio and mean average. The red color shows the dispersion of differentially expressed genes while grey color shows no variation.

Figure 5
figure 5

The Venn diagram shows that 1730 common DEG’s were found in the datasets.

Classification outcomes

Classification models were built to differentiate BC samples based on DEG’s identified byDESeq2 tool. The classification algorithms including SVM, Naïve Bayes and kNN were employed on the training dataset of 134 samples and test dataset comprising of 32 samples. The train and test datasets were used as input data for the classifier. The accuracy level rose up to 84% in the validation stage of the kNN algorithm whereas the accuracy achieved in SVM was the lowest as 71% while the accuracy of Naïve Bayes was observed to be 81%. All the samples were successfully classified by the models, the results are shown in (Fig. 6) and (Table 2) respectively.

Figure 6
figure 6

Results of SVM (A), Naïve Bayes (B) and kNN (C) respectively.

Table 2 Classifier’s results.

Pathway analysis

Gene enrichment analysis and KEGG pathways of 1730 common DEG’s were identified by using DAVID tool. The biological processes (BP), molecular functions (MF) and cell components (CC) were obtained which are shown in (Tables 35) respectively. Genes were involved in different biological pathways including mammary gland alveolus development (GO:0060749), response to drug (GO:0042493), natural killer cell mediated cytotoxicity (GO:0042267), regulation of insulin secretion (GO:0050796), peripheral nervous system development (GO:0007422), cAMP-mediated signaling (GO:0019933), and regulation of cell growth (GO:0001558), as detailed in (Table 3). The GO molecular function analysis revealed the involvement of DEGs in phosphatidylinositol phospholipase C activity (GO:0004435), mRNA 5’ UTR binding (GO:0048027), and calcium ion binding (GO:0005509), Table 4. In addition, CC group genes were mainly enriched in the extracellular space (GO:0005615), basolateral plasma membrane (GO:0016323), and extracellular region (GO:0005576), (Table 5). Furthermore, we classified DEGs associated with different biological pathways according to the KEGG reference database using the DAVID method (P < 0.05; FDR < 0.05). The KEGG pathway analysis showed the association of DEG’s in cell cycle, Insulin secretion, pathways in cancer and prostate cancer. The results are exhibited in (Table 6). REVIGO was used to visualize gene ontology in form of scatter plot. The scatter plot depicts semantic similarity between GO terms on x-axis, whereas the y-axis indicates p-value or significance, the plot is shown in (Fig. 7). The x-axis shows that terms that are functionally closely related. A lower p-value indicates a greater significance for terms positioned higher on the y-axis. GO hierarchies can be represented by different colors in the scatter plot.

Table 3 Biological processes in which genes are involved.
Table 4 Molecular Functions in which genes are involved.
Table 5 Cellular Components in which genes are involved.
Table 6 KEGG Pathways of DEGs.
Figure 7
figure 7

The Scatterplot represents the cluster representatives (i.e. terms remaining after the redundancy reduction) in a two-dimensional space derived by applying multidimensional scaling to a matrix of the GO term semantic similarities.

Network analysis

In network analysis, the gene interacting network was constructed by STRING and was visualized by Cystoscape. Network was constructed for 1730 differentially expressed genes which consisted of 1505 nodes and 9714 edges, (Fig. 8); the interaction between the two nodes determined the co-relation. In Cytohubba, the MCC algorithm measures the centrality of nodes by analyzing their involvement in large cliques. A network structure and connectivity can be determined by identifying hub genes. The top 10 Hub genes identified were: CDC20, CDK1, BUB1, AURKA, CDCA8, RRM2, TTK, CENPF, CEP55 and NDC80, the network involving the aforementioned genes is shown in(Fig. 9).

Figure 8
figure 8

Network of differentially expressed genes by String. The thick lines indicate significant association, functional similarity or co-regulation between the genes while thin lines represent low level interactions. Genes associated with thin lines still exhibit level of association but the significance is relatively low.

Figure 9
figure 9

Top 10 hub genes are identified based on MCC algorithm.

Hub genes expression analysis

Transcriptional and translational expression levels of all hub genes were significantly higher (P = 0.05) in cancerous tissues compared with normal tissues. Furthermore, based on patient menopause status, hub gene expression levels were significantly higher in breast cancer samples than in normal samples in patients at different cancer stages as shown by box and whisker plots at (Fig. 10).

Figure 10
figure 10

Box and whisker plot exhibiting expression profiles of ten hub-genes at various menopausal stages shows statistically significant differences among premenopausal, perimenopausal and postmenopausal patients compared to normal controls based on data from The cancer genome atlas (TCGA) database.

Discussion

In this study three RNA-Seq datasets comprising of ER+ and TNBC samples were studied, having been-processed, aligned, screened and filtered for duplicates, and finally processed for calculation of expression counts; thus 1730 overlapping DEG’s were identified which served as the training and test dataset for classification models to identify transcriptomic patterns which may help differentiate between ER+ and TNBC. The DEG’s of ER+ and TN samples were filtered on the basis of logFC and p-values. Pathway and network analysis of the selected DEG’s was performed at DAVID25 and Cytoscape26. Classification models were built based on three different algorithms to successfully differentiate between ER+ and TNBC types. The accuracy, sensitivity and specificity of the classifiers were estimated. Highest accuracy was exhibited by kNN classifier that is 84% as compared to other two classifiers SVM and Naïve Bayes whose accuracy was 72 and 81% respectively. Thus, kNN was found to be a best classifier between ER+ and TNBC types.

The DEG’s were identified across three RNA-Seq datasets, and three classification models, Support Vector Machine (SVM), Naïve Bayes, and k-Nearest Neighbors (kNN), were built to distinguish between ER+ and TNBC samples which is clinically extremely important for diagnosis and the choice of therapeutic alternatives. Often a miss diagnosis of TNBC i.e. false negative TNBC cases, which are mistakenly diagnosed as ER+ , lead to a lot of clinical complications and vice versa We therefore improvised ml classifiers training upon aforementioned DEG data to come up with a protocol which could help improving the current methodology for BC. To evaluate the effectiveness of each model in correctly discriminating between ER+ and TNBC cases and minimizing false positives, performance metrics such as accuracy, sensitivity and specificity were used. This comprehensive analysis not only elucidates the key molecular signatures which could serve to discriminate ER+ from TNBC but also underscores the utility of machine learning methodologies in enhancing the accuracy of BC diagnosis.

Results of GO analyses including CC, MF, BP showed that these overlapping DEG’s were primarily enriched in extracellular space and are associated with cell cycle, positive regulation of cell proliferation, cAMP-mediated signaling, transcription factor binding, sequence-specific DNA binding, calcium ion binding. In addition, the KEGG pathway enrichment analysis indicated that these overlapping DEGs were significantly enriched in pathways in cancer, cAMP signaling pathway, cell cycle, oocyte meiosis, estrogen signaling pathway, p53 signaling pathway and calcium signaling pathway. These enriched gene function and KEGG pathways provide insights regarding the molecular mechanism of ER+ and TNBC progression. Our analyses led to the inference that CDC20, CDK1, BUB1, AURKA, CDCA8, RRM2, TTK, CENPF, CEP55, and NDC80 serve as hub genes in the progression of ER+ and TN and also a predictor for the worst survival rates of BC patients. As illustrated in TCGA analysis, the breast cancer samples in multiple clinicopathological subgroups, the ten hub genes were consistently overexpressed (p0.05) in patients.

Previous studies have revealed that CDK1, BUB1, AURKA, CDCA8, RRM2, TTK, CENPF, CEP55 and NDC80 are implicated in cell cycle and associated with tumorigenesis. The CDK1, also known as CDC2, is involved in the precise division of cells40. In the TNBC clinical subtype of breast cancer, inhibiting CDK1 expression can suppress tumor cell growth and induce apoptosis41. In addition, BUB1 is one of the key mitotic checkpoint genes whose expression level is closely correlated with the proliferation of carcinoma cells42,43,44. RRM2, a breast cancer hub gene has been found to be closely associated with tumor growth, invasion, angiogenesis, tumor metastasis, as well as the prognosis of patients with breast cancer45,46. Furthermore, protein kinase TTK is capable of phosphorylating both serine and threonine simultaneously. The TTK plays a crucial role in cell division and is highly expressed in a wide variety of malignant tumors47.

Approximately 73% of patients with breast cancer overexpress Aurora kinase A (AURKA), a kinase essential to cell division and particularly the process of chromosome segregation during mitosis48,49. AURKA plays an important role in spindle assembly, centrosome maturation, and chromosome alignment49. Breast cancer development is negatively affected by the overexpression of AURKA. Similarly, CDCA8, also known as cell division cycle associated 8, is a part of the chromosomal passenger complex. It plays a crucial role in mitosis by regulating chromosome alignment and segregation at the centromeres50. Centromere protein F (CENPF) has previously been reported to be a marker of cell proliferation in several human malignancies, including breast cancer51,52. The centrosome protein 55 (CEP55) is an important microtubule-binding protein that is located in the centrosome of interphase cells and in the midbody of metaphase cells. It has been observed that CEP55 is overexpressed in several cancer types, such as colon, lung, and breast cancer53. It has been shown that NDC80, CDK1, and CCNB1 play key roles in breast cancer pathophysiology, such as regulating the growth and invasion of the cancer54. In accordance with our research, these hub genes might serve as potential biomarkers for the early-stage diagnosis and prognosis of ER+ and TNBC breast cancer. Thus, aberrations in their expression level (logFC) can be associated to the onset of breast cancer. As a consequence of this inference, we also pursued to develop ML models which could successfully distinguish the RNA Seq profile of an ER+ or TNBC affected individual from the normal healthy individuals, as our datasets include the patients both in the early-stage metastasis stage of the BC. Seven of the ten hub genes identified in the study, CDK1, CDC20, CEP55, CENPF, BUB1, TTK and AURKA have been associated with ER+ immune signature in various studies but they have not been research for their association in TNBC as of now. This study comes up with another three hub genes CDCA80, RRM2 and NDC80 which may help to potentially re-refine the unique immune signature for ER+ and TNBC. Putatively the ten hub genes identified here may also help revise the immune signatures for TNBC and also to distinguish it from rest of BC types.

Of course, these genes have been reported earlier on the basis of various gene association studies to be immune signatures of TNBC. But this study is first of its kind which clearly illustrates that association on the basis of experimental evidence as exhibited by transcriptomic datasets. The analysis also hypothesizes that the key features in variation in expression of these hub genes may also be associated to the BC.

RNA seq analyses of three datasets comprising of 134 samples, also illustrates that these genes may serve as biomarkers or immune signatures distinctly for ER+ and TNBC types. Therefore, we not only report the transcriptomic attributes associated to TNBC etiology but also a set of genes which are also associated to the other uncontrolled BC type such as ER+ . Our models along with identified hub genes provide for key features exclusively associated with both of the BC types.

Globally, breast cancer is one of the most prevalent cancers affecting women. In advanced stages of breast cancer, the disease can spread to the entire body through blood vessels and lymphatics, resulting in death directly caused by the disease. In spite of the promising results of advanced therapies for controlling breast cancer prior to metastasis, the treatment of advanced stage breast cancer remains a challenge. The therapies for preventing breast cancer recurrence and metastasis are also scarce. Hence, finding biomarkers which could help improving the diagnosis strategies, monitoring the metastasis of breast cancer, and understanding its peculiar mechanisms is of utmost importance.

Conclusion

The current study, involving three extensive datasets containing 134 ER+ and TNBC transcriptomes, led to the identification of 1730 differentially expressed genes uniquely associated to ER+ and TNBC individuals. The hub genes can serve as biomarkers for the diagnosis and/or prognosis of ER+ and TNBC patients. Pathway enrichment analysis and network analysis revealed the key signaling pathways implicated by these genes. Classification models based on SVM, Naïve Bayes and kNN were built on datasets. These models were ranked on the basis of accuracy, specificity and sensitivity. kNN was ranked as best classifier with sensitivity of 95%, accuracy of 84%, and specificity of 66%,. We successfully demonstrated that transcriptome analysis integrated with ML classifiers can be used to improve diagnosis of ER+ and TNBC patients.