Introduction

Despite improvements in diagnosis and therapy, breast cancer is the top cause of death for women globally, presenting challenges for patients, healthcare professionals, and researchers. Customized treatment strategies require accurate prognosis [1]. An AI branch called machine learning trains algorithms on massive datasets to find patterns and predict. It’s increasingly used in medicine for prognosis and diagnosis. These algorithms search large patient datasets for breast cancer prognosis trends and factors. Analyzing breast cancer cell gene expression with artificial neural networks can predict patient outcomes. Machine learning can predict breast cancer outcomes with precision and individualization by considering many prognostic factors, identifying previously unknown prognostic components, and customize treatment strategies for individual patients.

Using Machine Learning algorithms on big patient datasets, researchers predicted breast tumor outcomes based on tumor size and morphology. A machine learning algorithm uses patient age, tumor features, hormone receptor status, and lymph node involvement to predict outcomes and offer tailored treatment [2]. While promising, machine learning struggles to forecast breast cancer prognosis. Data processing requires sophisticated computational infrastructure and algorithm training requires enormous amounts of high-quality data. Medical professionals must understand the algorithms for them to be useful in clinical settings.

In order to forecast the prognosis of breast cancer, it is necessary to evaluate factors such as tumour size, grade, hormone receptor status, lymph node involvement, and genetic alterations. Accurate prediction is needed to tailor therapy to each patient [3]. Prognostic techniques with many parameters have been developed, including the St. Gallen International Consensus Guidelines and the Nottingham Prognostic Index. Since they use arbitrary norms and cannot account for individual variances, these tools are restricted.

Machine learning, applied extensively in medicine, aids in outcome prediction and sickness diagnosis, particularly in breast cancer [4]. It analyzes vast datasets to predict prognostic tendencies, leveraging gene expression for patient outcome forecasts. In image analysis, it interprets MRIs and mammograms, identifying cancerous tissue patterns, aiding in early detection and treatment planning for aggressive tumors [5]. Additionally, in genetic data analysis, it identifies breast cancer-causing mutations, facilitating tailored treatments for individual patients and simplifying prognostic models.

Machine learning’s ability to assess and deliver precise plans is useful for personalized treatment plans depending on age, tumor size, and genetic anomalies. Identifying specific mutations allows for individualized therapy regimens that meet patient needs. Despite the benefits, problems persist. Large, high-quality data for algorithm training is hard to find, especially in breast cancer-free areas. Advanced data processing computing infrastructure may be expensive or unavailable in some hospitals.

The significance of work:

The aim of this research is to develop an early analytic model tailored to breast cancer, aimed at facilitating timely prognosis and diagnosis by harnessing the power of ML algorithms. The key stages of this study encompass:

  1. 1.

    Employing various machine learning techniques, including the K-nearest neighbors (KNN) classifier, DT, RF, GNB, SVC, and others, within the domain of breast cancer diagnosis.

  2. 2.

    Constructing a diagnostic model using machine learning to enable swift detection and prognosis of breast cancer, ultimately assisting healthcare professionals in making learned findings regarding patient care.

  3. 3.

    Implementing K-Fold cross-validation to assure findings reliability and robustness, boosting the model’s usefulness and credibility in breast cancer detection and prognosis.

In Section Introduction, the paper commences with an introduction outlining the significance and context of breast cancer research. Section Review of literature offers a detailed review of existing literature concerning breast cancer, followed by Section Materials and methods that explicates the materials and methodologies employed in this study. Section Results of different machine learning algorithms or classifiers presents the outcomes derived from diverse machine learning algorithms or classifiers. Finally, the paper concludes with a comprehensive discussion and conclusion, summarizing the findings and their implications.

Review of literature

Breast cancer is the main cause of cancer death in women globally and a serious public health concern. Breast cancer detection and treatment have been somewhat successful, but machine learning techniques could greatly enhance accuracy. In this review of literature, we will explore the role of fuzzy logic and machine learning in improving breast cancer diagnosis and treatment. Medical imaging is a crucial tool for breast cancer diagnosis, but interpreting these images can be challenging, particularly in cases where the tumor is small, or the breast tissue is dense. Fuzzy logic has been proposed as a useful tool for improving the accuracy of breast cancer diagnosis using medical imaging. Fuzzy logic is a mathematical technique that can handle imprecise and uncertain information. It can be used to develop computer algorithms that can analyze medical imaging data and provide more accurate diagnoses.

A study by Jafari-Khouzani and El Naqa (2013) [6] explored the use of fuzzy logic in the analysis of breast cancer applying mammography images. The person responsible developed a computer algorithm based on fuzzy logic that analyzed mammography images and provided a analysis of breast cancer. The algorithm was trained on a dataset of 143 mammography images and achieved a diagnostic accuracy of 85.3%.

Medical imaging data analysis for breast cancer diagnosis has also showed promise using machine learning. Esteva et al. (2019) [7] examined deep learning systems for breast cancer diagnosis via digital pathology pictures. The authors developed a deep learning algorithm that analyzed digital pathology images of breast tissue and provided a diagnosis of breast cancer. The algorithm was trained on a dataset of 238,289 digital pathology images and achieved a diagnostic accuracy of 92.5%.

Another area where machine learning can be helpful in breast cancer research is in the analysis of genetic data. Advances in genetics have led to the identification of several genetic mutations that can raise breast cancer risk. The algorithms used in machine learning have the ability to examine genomic data and identify patterns that are suggestive of mutations. After that, this data can be put to use in the process of developing individualised treatment strategies that are tailored to meet the requirements of each individual patient.

A study by Li, Y. and Z.J.A.C.M. Chen, [8] explored the use of machine learning algorithms to analyse genetic data related to breast cancer detection and treatment. They developed an algorithm that examined genetic information from patients with breast cancer in order to identify mutations associated with poor outcomes. After training on a dataset comprising 1,881 patients with breast cancer, the algorithm achieved a 70.9% predicted accuracy.

Machine learning can also be helpful in the growth of prognostic models for BC. Prognostic models are used to predict the likelihood of recurrence of breast cancer after treatment. Typically, these models take into account a number of variables, including lymph node involvement, tumour size, and grade. However, these models can be complex and difficult to interpret. Machine learning algorithms can be used to simplify these models and make them more easily understandable.

Lambertini, M., et al. [9], looked at the creation of breast cancer prognostic models using machine learning methods. Their method was designed to anticipate the likelihood of a return of breast cancer after treatment by analysing patient data, including medical history, tumour characteristics, and treatment records. With 2,564 breast cancer patients as its training dataset, the system produced a 72.1% predicted accuracy.

Table 1 summarizes the findings from a review of literature on the role of fuzzy logic and machine learning in improving BC diagnosis and treatment. The table includes the author(s) and year of publication, the methodology used, the sample size, the data type, and the results of each study.

Other studies focused on analyzing genetic and clinical data. Li et al.developed a ML algorithm that analized genetic data from breast cancer patients and identified genetic mutations that were associated with poor prognosis. Nguyen et al. developed a machine learning algorithm that analyzed patient data to predict the likelihood of recurrence of breast cancer after treatment.

The studies also highlight the potential of fuzzy logic and machine learning in developing prognostic and predictive models. For example, Zhang et al. developed a fuzzy logic-based prognostic model that predicted overall survival of breast cancer patients with an accuracy of 75.6%.

Table 1 Review of literature

Materials and methods

Within the context of this study focused on breast cancer, we developed a diagnostic and prognostic model. Our approach involved a systematic breakdown of the process, commencing with the initial phase of data acquisition. Subsequently, we proceeded to perform data preprocessing, and ultimately utilized ML classifiers to assess the model’s performance, primarily measuring the accuracy of BC prediction outcomes for an illustration of the process (Fig. 1).

Fig. 1
figure 1

The workflow for implementing the suggested diagnostic model for breast cancer diagnosis

Data collection

In 1992, trained ML algorithms on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset [31]. Their study used a digital picture of a breast mass obtained through fine needle aspiration (FNA) to collect dataset parameters [20]. These traits reveal properties of the cell nuclei in the photo [20]. The dataset contains 569 data points, 212 cancerous and 357 normal. Its ten primary properties are radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. Dataset also provides the mean, standard error, and “worst” or highest value for each attribute by averaging the three largest values [20]. Thus, the dataset comprises 30 attributes for analysis. The Table 2 describes the dataset.

Table 2 Dataset

Data preprocessing

In the machine learning pipeline, the “data preprocessing” step is the most crucial. Unprocessed data are transformed into processed (meaningful) data by data preparation. Before the dataset can be utilized for analysis, it must be cleaned, standardized, as well as noise-free as in Table 3.

Table 3 Data used in study

We can visualize the data see Fig. 2 i.e., data preprocessing task that involves counting the distinct or different values within categorical features in a dataset. Here we are concerned with Malignant and Benign categories.

In the BCPM, missing values in the dataset are addressed through a process of imputation, where the missing values for specific features are replaced with mean-derived values. This approach helps maintain the integrity of the dataset and ensures that the analysis is not compromised by missing data. Regarding the methodology used to divide the data for training and validation purposes, the BCPM employs the K-Fold cross-validation method. This technique involves dividing the dataset into k equal-sized parts or folds. The model is trained on k-1 folds and validated on the remaining fold. This process is repeated k times, with each fold serving as the validation set exactly once. By averaging the results from each iteration, the model’s performance is evaluated more reliably, enhancing its effectiveness and credibility in the context of breast cancer diagnosis and prognosis.

Fig. 2
figure 2

Data visualization

The link between these (M and B) in regard to several parameters, including diagnosis, radius_mean, texture_mean, perimeter_mean, and rea_mean, is depicted in Fig. 3.

Feature selection

A crucial step in developing a prediction model for breast cancer is “feature selection.” This strategy simplifies processing needs and sometimes improves model performance by decreasing variables (or inputs). Interestingly, we replace missing values for specified dataset attributes with mean-derived values. The “fit and transform” technique is then used to standardise and normalise the data.

There are various features with extreme values, as seen in Fig. 3. These values require consideration in our research because it became clear through a careful inspection of the data that they are not the result of outliers or errors. We must take into account precipitation data, understanding that they are estimations of rainfall and subject to large regional variations.

Fig. 3
figure 3

Features correlation

A crucial step in developing a prediction model for breast cancer is “feature selection.” By reducing the number of variables (or inputs), this approach seeks to simplify computational needs and occasionally improve the overall performance of the model. Interestingly, we replace missing values for specific features in our dataset with mean-derived values. The “fit and transform” technique is then used to standardise and normalise the data.

Results of different machine learning algorithms or classifiers

Results

Logical regression, support vector, random forests, and decision trees are some of the machine learning classifiers that are included here. The data is divided into ten equal-sized parts for categorization using k-fold cross-validation. This yielded the mean value in Table 4 after five iterations.

Table 4 As a sample cross validation for logistic regression

We have successfully implemented cross validation for logistic regression, we will now implement the same on different ML Classifier and see the results in Table 5.

Table 5 The cross-validation function by mean for our select model predictions

Some models provide perfect scores, indicating that overfitting occasionally happens. When a machine learning model performs well on training data but finds it difficult to generalise its predictions to fresh, unobserved data, this is known as overfitting [29]. In other words, the model grows so good at learning from the training data that it learns to include the noise or random oscillations in the training set in addition to the basic patterns [30]. As a result, it fits the training data perfectly, but when exposed to new data, its performance deteriorates because it cannot distinguish between real patterns and noise.

When it comes to classification jobs, a classification report is a useful tool in machine learning and data analysis. It presents a thorough assessment of a classification model’s effectiveness. Table 6 shows the classification report of different ML Classifier used for the prediction of BC.

Table 6 Classification results from a variety of machine learning algorithms

Now we will see the highest accuracy score among different ML Classifiers in Table 7.

Table 7 Average machine learning classifier rankings after 10-K fold application

In Table 6 we clearly deduce that Random Forest classifier outperform all the other classifier by achieving 92.55% accuracy.

Hyper parameters tuning

When developing machine learning models, hyperparameter optimisation, also known as hyperparameter tweaking, is an essential step. In order for a machine learning algorithm to function at its highest potential, it is essential to determine the ideal values for its hyperparameters. Hyperparameter tuning frequently makes use of strategies like grid search, random search, Bayesian optimisation, or even more complex methods like evolutionary algorithms. It is often through trial and error that the optimal hyperparameter setup for a given machine learning issue is discovered. While computationally intensive, hyperparameter tuning is an essential part of creating models that perform well and generalise well to data in the real world.

For ideal HyperTuning performance parameters, GridSearchCV proved to be a beneficial tool. Using “fit” and “score” methods, the parameters of the estimator are fine-tuned over a predetermined parameter grid in this cross-validated grid search. Functions such as “predict,” “predict_proba,” “decision_function,” “transform,” and “inverse_transform,” as shown in Table 8, are implemented by GridSearchCV if the estimator allows it.

Table 8 Hyper parameters tuning

In Table 7 we clearly deduce that Grid Search Algorithm outperform all the other by achieving 92.6383% accuracy.

Discussion and conclusion

In the field of research pertaining to breast cancer, deep learning has emerged as a pivotal tool for image segmentation, continuously advancing precision levels. Nevertheless, the focal point lies in optimizing deep learning, a multi-faceted endeavour encompassing several key dimensions. These dimensions encompass refining deep network architectures, employing ensemble learning techniques, fine-tuning hyperparameters through empirical methods, optimizing loss functions in alignment with evaluation metrics, and selecting appropriate optimizers and activation functions.

Using machine learning techniques including KNN, D.T, R.F, SVR, and Gaussian Naive Bayes (GNB), this research aims to create a breast cancer detection model. This model aims to make accurate predictions concerning disease progression and facilitate early diagnosis. In light of these impending initiatives, the primary emphasis should be directed towards causal-effect models for disease diagnosis. It is not only imperative to detect the illness but also crucial to analyze the factors exerting the most significant influence on its occurrence. Achieving both objectives is imperative for success. A deeper understanding of the disease’s etiology, coupled with the development of more accurate diagnostic models, holds immense potential in combatting breast cancer and reducing associated complications and fatalities. Addressing data uncertainty through modeling is another critical domain. One of the foremost challenges to enhancing previously developed models lies in the subpar quality of epidemiological data related to breast cancer. Lastly, the deployment of autonomous loops for data analysis aids in streamlining disease control decision-making processes.

While Decision Trees (D.T.), KNN, SVR, and GNB all yield favourable results, the Random Forest (R.F.) method exhibits superior performance albeit at the cost of increased computation time. Therefore, it has been determined that the RF-based diagnostic model is the most effective of these machine learning algorithms for detecting breast cancer at an early stage. This conclusion is substantiated by the following considerations.

The formidable challenge in this endeavour primarily stems from the multitude of optimization factors and strategies that necessitated empirical exploration to establish final design specifications. Even with the reduction of trainable parameters in the network to accommodate hardware limitations, substantial CPU power remains a prerequisite for completing training processes. Researchers have found that using deep and machine learning on breast cancer data has led to significant advances in both detection and an understanding of the disease’s complexities [32]. Deep learning, with its capacity for precise image segmentation, has become a crucial tool in this endeavour [33, 34]. However, optimizing these models remains a complex challenge, involving various levels of refinement, from network architectures to hyperparameter tuning.

Our research focused on developing a diagnostic model for BC using a range of ML algorithms. The aim was to enhance early detection and provide accurate predictions of disease progression. To achieve this, we emphasized the importance of causality models in disease diagnosis. It’s not enough to merely detect the disease; we must also identify the key factors influencing its development. This dual approach holds the potential to significantly impact breast cancer outcomes. We have compared the existing ML model accuracy with our models in Table 9.

Table 9 Comparison between existing and proposed model

A deeper understanding of breast cancer’s etiology, coupled with more accurate diagnostic models, can aid in the fight against this disease, reducing complications and fatalities. Furthermore, addressing data uncertainty through modelling is crucial, taken into consideration the challenges that are presented by the quality of epidemiological data in this area. While several machine learning algorithms showed promising results, the Random Forest (R.F.) method emerged as the most suitable for early-stage breast cancer diagnosis, despite its computational demands.

Further research into personalized treatment recommendations using machine learning can significantly enhance breast cancer treatment plans by tailoring them to individual patient characteristics. Additionally, improving deep learning models for mammogram analysis can lead to better early detection and reduce false positives. Focusing on strategies to enhance the quality of epidemiological data is also crucial for robust machine learning research in breast cancer.