Introduction

Cancer is a deadly disease and is known to be the leading global cause of death. According to the World Health Organization (WHO) 2019 report, cancer was ranked as the first or second most common cause of death in 112 out of 183 countries for individuals under the age of 701. In addition, the International Agency for Research on Cancer (IARC) conducted research indicating that cancer resulted in 2018, there were 9.6 million deaths and 18.1 million new cases. This number increased to 18.1 million new cases and 9.9 million cancer-related fatalities in 20202,3. The primary structure of antimicrobial peptides (AMP) found that it also has anticancer activities4. Thus, AMP was renamed to anticancer peptides (ACP). Briefly, it is a short string of amino acids that usually consists of 5 to 30 amino acids5. The key advantage of ACPs, in contrast to anticancer treatments, is that they do not disrupt and affect the body's normal cells. Further benefits include high specificity, easiness of synthesis, modification, and low production cost6,7. Despite its significant role, the identification of these ACPs has several challenges. Some of them are as follows: first, identifying ACP through clinical experimentation is tedious. Second, it takes a humanly large amount of time. Third, the process is very costly. Therefore, conducting manual experimentation on a large scale is nearly impossible. Automating the identification of ACPs via computational methods based on ML algorithms is very important. Recently, various researchers have developed many intelligent models to automate the prediction of ACP. An intelligent model is proposed in8 named “ACP". In this model, they utilized an optimized G-Gap dipeptide composition, formulating peptide sequences. Similarly, silico models were proposed that comprised four different datasets for ACP prediction9. The peptide sequences are formulated using the binary profile (BF) model and SAAC. In addition, a hybrid feature-based predictor is developed for feature extraction10. This model used reduced amino acid composition (RAAC), average chemical shifts (ACS), and amino acid composition (AAC). Their key objective was to achieve higher accuracy using SVM. "iACP-GAEnsC"11 is used for the prediction of ACP. They used a composite encoding method to collect high discriminative features in peptide sequences. In recent years, an intelligent predictor, "TargetACP”, has been developed12. This model is based on sequential and evolutionary information. In this model, the Synthetic Minority Over-Sampling Framework (SMOTE) was applied to balance samples in minority and majority classes. Two datasets were used, and hence, improved performance was achieved. Additionally, a web server is proposed for the prediction and design of ACPs13. A similar model was proposed in14 which used local kernel alignment and PseAAC to predict ACP. G-gap dipeptide composition was developed for the representation of the sequences15. They have used maximum relevance-maximum distance (MRMD) to remove the redundant and irrelevant features.

However, all these techniques have limited accuracy and, therefore, need significant improvement. On the other hand, much research has been performed on developing traditional anticancer activities. These include Targeted Therapy, Chemotherapy, Radio Therapy, and Surgery. However, these methods were insufficient because they affect the normal cells in the body. In addition, these methods have many side effects like high cost, hair loss, and reactive impacts. Therefore, researchers were urged to use alternative methods with fewer side effects. Thus, the motivation of our research is to develop a highly efficient ACP feature extraction framework.

The motivation behind proposing the EDPC framework stems from the limitations of existing methods like SAAC and PseAAC, which do not fully capture the complexity of peptide sequences. EDPC addresses these gaps by incorporating extended dipeptide patterns and local sequence environments, leading to a richer and more comprehensive feature set. The expected benefits of EDPC include improved accuracy, reduced noise and redundancy, and a more holistic representation of peptide sequences. This enhancement in feature extraction is anticipated to significantly impact anticancer peptide identification, facilitating more reliable and efficient discovery of potent anticancer peptides and advancing the field of peptide-based cancer therapies. The ultimate objective of this research is as follows.

  • In this research effort, we propose an innovative feature extraction approach, EDPC, to enhance the identification process of anti-cancer peptides.

  • We utilize ensemble learning based on different feature extraction models. We used the cluster database at high identity with tolerance (CD-HIT) framework to address the problem of noise and redundant features. The assessment is conducted on the EDPC features extraction in relation to two widely utilized feature extraction methods: SAAC and PseAAC.

  • We have compared four distinct ML algorithms: Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and K-Nearest Neighbor (KNN) with the proposed model to check the accuracy of the proposed model.

  • The performance of EDPC is evaluated using state-of-the-art techniques and ML algorithms. After a thorough investigation, we discovered that the proposed EDPC outperformed the accuracy of the current state-of-the-art models. The result shows that SVM outperforms all equivalents with amazing accuracy on the suggested method, i.e., 96.6% and 90.3% for short and large datasets, respectively, even though RF and DT perform better in some circumstances.

  • This research is used to identify efficient methods for cancer treatment. Our ultimate objective is to expedite the discovery of powerful ACPs, which would open a new direction for cancer treatments.

The rest of the paper is organized as follows: "Literature review" provides an overview of the recent state of the art. "Meterials and methods" discusses the proposed model. "Results and discussion" presents the results and discussion. "Conclusion and future work" concludes the conclusion of this study.

Literature review

ACPs are a promising class of cancer therapeutics. The rational design optimizes peptide properties for enhanced anticancer activity by modifying AAC and computational modeling. This approach requires expertise and extensive validation.

Rational design of ACPs

Rational design is an effective strategy for the development of ACP. It allows for the customization of peptide properties to enhance anticancer activity. This approach involves the design of peptide sequences based on their physicochemical properties, structural features, and interactions with target molecules. One standard rational design method is to modify the AAC of existing peptides to improve their activity. For instance, KR-12, a modified peptide of LL-37, is designed by introducing the cationic and hydrophobic amino acids into the sequence. This is to enhance the anticancer activity and inhibit tumor growth in vivo by inducing apoptosis in cancer cells14. Another strategy to fight cancer involves developing peptides. This specific target molecule plays a role in cancer development and progression. In this regard, one example is the peptide iRGD. This is designed to target \(\alpha v\) Alpha v) integrins that are frequently excessively expressed on the outermost layer of cancer cells. This research demonstrated that iRGD could enhance the delivery of drugs and nanoparticles to tumors and, hence, improve therapeutic efficacy16. Computational modeling is simulations and docking studies. For instance, it can be used to study the interactions between peptides and target molecules, as well as predict the binding affinity of peptides for specific receptors16. The ACPs have arisen as a promising new class of cancer therapeutics owing to their high specificity and potency against cancer cells. The design of ACPs is a valuable approach that enables the customization of peptide properties to enhance their anticancer activity17. A rational design involves the design of peptide sequences based on their physicochemical properties, structural features, and interactions with target molecules18.

Another standard rational design method is to modify the AAC of existing peptides to improve their activity. For example, KR-12, a modified peptide of LL-37, was designed by introducing cationic and hydrophobic amino acids into the sequence. The objective is to enhance its anticancer activity and inhibit tumor growth in vivo by inducing apoptosis in cancer cells19. In addition, computational modeling is also utilized in rational design to predict the structure and properties of designed peptides. Molecular dynamics simulations and docking studies can be used to study the interactions between peptides and target molecules, as well as predict the binding affinity of peptides for specific receptors20. Computational modeling was used to design the peptide BP100-D421. Rational design is a promising strategy for developing ACPs that allows optimizing peptide properties to improve their anticancer activity and selectivity. However, it requires expertise in protein chemistry and computational modeling. In addition, extensive experimental validation is required to confirm the activity and safety of designed peptides (Supplementary information).

Sequence-based features for ACP prediction

AAC refers to the frequencies of individual amino acids in a peptide sequence. It is the simplest sequence-based feature. This feature has been employed in numerous studies, including the research conducted by Ahmad et al.22. They used AAC to develop an SVM model for ACP prediction. Similarly, using ML algorithms, Sequeira et al.23 used the AAC and other sequence-based features to build a prediction model for ACPs. Dipeptide composition is another commonly used sequence-based feature extraction framework that considers the frequency sets of amino acids in a peptide sequence. Meher et al.24 developed a prediction model for ACPs using dipeptide composition and SVM. In another study, Lv et al.25 used dipeptide composition and other features to build a prediction model for ACPs using gradient boosting. Wang et al.26 developed a prediction model for ACPs using physicochemical properties and SVM. Sequence-based features are widely used for ACP prediction, and different features can provide complementary information about the properties of peptides relevant to their anticancer activity. ML algorithms, SVM and RF, have been used to develop prediction models for ACPs based on sequence-based features.

AAC is the simplest and most commonly used sequence-based feature. It refers to the frequencies of individual amino acids in a peptide sequence. Several studies have used AAC to predict ACPs, such as Basith et al.27 developed an SVM model for ACP prediction based on AAC. Similarly, Wei et al.28 used AAC and other sequence-based features to develop a prediction model for ACPs using ML algorithms. Dipeptide composition is another commonly used sequence-based feature that considers the frequency of pairs of amino acids in a peptide sequence. Manavalan et al.29 developed an SVM-based prediction model for ACPs using dipeptide composition. Boopath et al.30 developed an SVM-based prediction model for ACPs using physicochemical properties. While sequence-based features are widely used for ACP prediction, they have several limitations. First, sequence-based features do not consider the three-dimensional structure of peptides, which is important for their activity. Second, sequence-based features do not provide information about the interaction between peptides and target molecules, which is critical for their selectivity. Third, sequence-based features cannot capture the complex relationship between peptide structure and activity. Sequence-based features are commonly used for ACP prediction and have provided valuable insights into peptides' physicochemical and structural properties relevant to their activity. However, these features have several limitations, and the development of more accurate prediction models will require the integration of sequence-based features with other types of features, such as structural and interaction-based features.

Machine learning

SVM is one of the most commonly used ML algorithms used for prediction31,32. SVMs are supervised learning algorithms with widespread applications in bioinformatics and many other fields. Their function is to identify the optimal hyperplane. Ahmad et al.11 developed an SVM-based prediction model for ACPs using a feature selection. The study used a dataset comprising 298 ACPs and 298 non-ACPs to train and test the SVM model. The authors used AAC and dipeptide composition as features and obtained an accuracy of 94.9%. The achieved results proved that the RF model is effective in predicting ACPs. Another ML algorithm used for ACP prediction is based on deep learning. Gulam et al.33 developed a prediction model for ACPs using a convolutional neural network (CNN). The authors used a dataset of 902 ACPs and 902 non-ACPs to train and test the CNN model. The authors used the amino acid sequence as input and trained the CNN model to predict whether a peptide is an ACP. ACPs offer a rapid and efficient way to identify potential therapeutic agents for cancer treatment. Therefore, in this study, the authors reviewed some of the recent studies that have employed ML algorithms for ACP prediction. Zhang et al.34 presented an SVM-based prediction XGB-RFE technique. The results showed that the SVM model achieved a high accuracy of 91.5% and an AUC of 0.87. This indicates that the SVM model is effective in predicting ACPs. Similarly, Wu et al.35 presented a novel tool called ACP-MCAM. Zhao et al.36 proposed a prediction model for ACPs using RF and an auto-correlation approach. The authors used a dataset of 1,265 ACPs and 1,265 non-ACPs to train and test the RF model. The authors used a hybrid feature selection approach, combining dipeptide composition and auto-correlation features. The results showed that the RF model achieved an accuracy of 87.9% and an AUC of 0.84, indicating that the RF model is effective in predicting ACPs. ML algorithms, such as RF and SVM, have shown promising results in predicting ACP.

In recent years, many AI algorithms and their applications, such as deep learning, have been developed37,38. Table 1 summarizes the existing state-of-the-art techniques for ACPs. ACPs involve designing peptides based on specific physicochemical properties, while sequence-based features for ACP prediction involve analyzing the sequence features of known ACPs to predict novel ACPs. On the other hand, ML algorithms for ACP prediction involve developing prediction models using ML algorithms that analyze the sequence and/or physicochemical features.

Table 1 Summary of the existing techniques for ACPs.

Materials and methods

In this research, the proposed feature extraction method consists of five phases, as shown in Fig. 1. In the first phase, the dataset is collected and pre-processed using the CD-HIT and iLearn techniques to remove redundancy and noise. In the second phase, features are extracted using SAAC, PseAAC, and the proposed EDPC. The data is pre-processed in the third and fourth phases to improve the dataset's quality. Finally, in the fifth phase, ML techniques, i.e., DT, SVM, RF, and KNN, are applied to classify the data.

Figure 1
figure 1

Flowchart of the proposed model.

Figure 1 presents the proposed model. The dataset was noisy and, therefore, required comprehensive pre-processing. The redundancy and noise are removed in the first stage. CD-HIT and iLearn techniques were utilized for pre-processing. Once the dataset was refined, the features were extracted using SAAC, PseACC, and EDPC methods. Finally, the ML techniques SVM, DT, RF, and KNN were applied to predict ACP.

Data set

In this research study, we have used two datasets ACP24039 and ACP74040. These are commonly used datasets in the research community for anticancer peptide prediction41,42. In this section, the authors provide a detailed description of these datasets. ACP240 is a bi-class dataset that contains anticancer peptides and non-anticancer peptides. This dataset has 1550 instances, 775 of which belong to a negative class, and the same number belongs to the positive instances39. ACP740 dataset comprised 740 samples for anticancer peptides. The sample sizes for positive and negative samples are 376 and 364, respectively40. These datasets were chosen for their comprehensive collection of anticancer and non-anticancer peptide sequences, offering a diverse range of samples for analysis. The APC240 dataset includes A anticancer peptides and B non-anticancer peptides, while the APC740 dataset contains C anticancer peptides and D non-anticancer peptides. This dataset is not refined, and comprehensive preprocessing is required to correct it. Ensuring that each class is equally represented in the training and testing sets is essential.

The researchers used the iLearn Plus webserver to pre-process the dataset. It accepts the sequences in a particular Fasta format. The learn web server is designed to take at most 2000 sequences simultaneously. It is an ML platform that provides web-based and graphical interfaces to users. It provides a wide range of algorithms and is used to automate sequence-based feature extraction. CD-HIT is an open-source project developed by Fu, et al.43. The idea is to minimize the size without removing any sequence information. The Cluster database at high identity with tolerance is abbreviated as CD-HIT. After receiving the sequence in Fasta format as input, it outputs a non-redundant set of values.

Feature extraction

Extraction of relevant attributes from primary sequences is a critical task in the development of a computational predictor for the identification of ACPs. To overcome this issue, the researchers in this study suggest SAAC, PseAAC, and EDPC.

Split Amino Acid Composition (SAAC)

SAAC is used for feature representation and is also helpful in overcoming prediction problems44 In this technique, the peptide sequence is split into dissimilar portions, and then the occurrence frequency of each part is calculated independently. In SAAC, the peptide sequence is divided into three parts: the N-terminus, the C-terminus, and the region between these two terminuses. It is represented in Eq. (1).

$$P=[{f}_{1}C,......,{f}_{20}C,{f}_{1}int,......, {f}_{20}int, {f}_{1}N,......,{f}_{20}N]$$
(1)

The equation below is used to generate the SAAC feature vector.

$$f(i)=NA({A}_{i}){X}_{n}, \, i=\text{0,1},......,19$$
(2)
$$f(i)=NA({A}_{i})M-Xy-{X}_{c}\text{ i=0,1,......,19}$$
(3)
$$f(i)=NA({A}_{i}){X}_{c},\text{ i=40,41,......,}$$
(4)

where A(A) means amino acid residue, NA(A) is the numbers of A(A) in different splits, M is the length of protein sequence, Xy means residues numbers of N-terminal splits, \(Xc\) is the residues numbers of C-terminal splits and \(f(i)\) is the ith SAAC feature vector element. It belongs to one of the segment's 20 frequencies of amino acid residues.

Pseudo Amino Acid Composition (PseAAC)

PseAAC is used to obtain the discrete and numerical features of the peptide sequences45. PseAAC was introduced to overcome the issues raised in AAC, such as correlation factors and lack of sequence order information. Chou introduced it in 2001, and it has been used widely in many fields, including protein attribute prediction, for example, Computational Biology, Drug discovery, biomedicine, antifreeze protein and mitochondria localization, etc. PseAAC can be represented as:

$$P=[{f}_{1},......,{f}_{20},{f}_{20}+1,......{f}_{20}+\lambda ]T$$
(5)

where \(P\) signifies PseAAC, \(T\) represents transposition and \({f}_{1},......,{f}_{20}\) represents the fraction of 20 unique amino acids.

Proposed extended dipeptide composition

The proposed EDPC is a novel feature extraction method developed as an extension to the Dipeptide Composition (DPC) technique. EDPC is based on discrete peptide sequences and uses neighboring amino acids to gather features. EDPC begins with the analysis of peptide sequences. Each peptide sequence is broken down into constituent dipeptides (pairs of amino acids). This step is crucial as it forms the foundation of the feature extraction process. EDPC extends the analysis for each dipeptide by considering surrounding amino acids up to a specified distance. This extension captures the local sequence environment, which is often critical in determining the biological activity of peptides. The algorithm quantifies the presence and frequency of these extended dipeptide patterns within each peptide sequence, transforming qualitative sequence information into a quantitative feature set. It obtains a feature vector of size 400-D by computing the frequency for each of the two adjacent amino acid residues. One of its main advantages is that EDPC combines the global information of each peptide while other feature extraction techniques, like AAC, only compute the frequency of amino acids in which they occur. This global information can be valuable in identifying anti-cancer peptides. To calculate EDPC, first, split the peptide sequence into dipeptides and count the occurrence frequency of each dipeptide sequence in the peptide. Then, compute the probability of observing each dipeptide sequence using the following equations:

$$EDPC(x)=\frac{D{{P}_{r}}{\prime}}{DP\_{T}{\prime}}$$
(6)
$$ P_{r}{\prime} = \frac{{P_{r} - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{P} }}{\delta } $$
(7)
$$ AC\left( d \right) = \mathop \sum \limits_{i = 1}^{n - d} P_{i} P_{i + d} $$
(8)

where,

$$ATS=\frac{AC(d)}{N-d}$$
(9)
$$I(d)=\frac{\frac{1}{N-D}{\sum }_{i=1}^{N-d}({P}_{i}-\widehat{P})({P}_{i+d}-\widehat{P}){P}_{i+d}}{\frac{1}{N}{\sum }_{i=1}^{N-d}({P}_{i}-\widehat{P}{)}^{2}}$$
(10)

where EDPC(x) is the occurrence frequency of the peptide sequences. DPr' is a single instance out of 400, and DP_T is the total number of dipeptide sequences. Algorithm 1 present the proposed EDPC.

Algorithm 1
figure a

Proposed EDPC

Classification techniques

A model is trained in classification using sample data instances where class labels are known in advance. Although many classification algorithms exist, SVM, RF, DT, and KNN are the most commonly used in the literature with significant results.

Support vector machine (SVM).

Support Vector Machine (SVM) is a powerful supervised learning method and was first developed by Vapnik46. In binary class problems, SVM aims to convert the data into a high-class feature vector to compute the optimum separating hyperplane. This optimum hyperplane has the highest margin from its support vectors for the reduction of the rate of the error of a test sample. SVM consists of various kernel functions: linear, polynomial, radial base function (RBF), and sigmoid for classification power. This article used RBF for obtaining the best classification hyperplane, whereas \(C\) and \(\gamma \) are used for examining the dataset. The kernel width parameter \(\gamma \) and the regularization parameter \(C\) are determined using the grid search method. The RBF function is defined as:

$$f\left(x\right)=sign(w\times x+b)$$
(11)

where \(f(x)\) represents the predicted class label for an input sample \(x\). feature of \(x\) is represented by \(w\), while the bias term is represented by \(b\). The \(sign( )\) function is used to determine the sign of the linear function w · x + b.

Decision tree

A DT is one of the most powerful and well-known algorithms. It is a supervised learning procedure. DT algorithm can be used for regression as well as classification problems. However, most of the time it is used to solve classification problems. DT possesses a structure like a tree that contains leaf nodes, branches, and root modes. A DT can be mathematically represented as:

$$Entrophy=\sum_{i=1}^{e}-{p}_{i}\times {log}_{2}({p}_{i})$$
(12)

Random forest (RF)

RF is a classification algorithm that is broadly used in bioinformatics. It is effectively used in ML to evaluate various regression and classification problems. An RF algorithm consists of multiple DT algorithms. It can be represented mathematically as:

$$MSE=\frac{1}{N}\sum_{i=1}^{N}{({f}_{i}-{y}_{i})}^{2}$$
(13)

\(N\) denotes the number of data points, \({f}_{i}\) denotes the model's output, and \({y}_{i}\) denotes the actual value of the data at point \(i\).

$$Gini=1-{\sum }_{i=1}^{e}({p}_{i}{)}^{2}$$
(14)

\({p}_{i}\) presents the relative frequency of class &\(c\) is the number of classes.

$$Entrophy={\sum }_{i=1}^{e}-{p}_{i}*{\mathit{log}}_{2}({p}_{i})$$
(15)

Entropy is used for the probability of certain results & log function is used for calculation mathematically.

K-nearest neighbor (KNN)

KNN is a well-known classification algorithm used in data mining and bioinformatics for prediction purposes. This type of algorithm works on instance-based ML which is why it is also known as a method of lazy learning KNN is used to classify the data sample into the class, which is most persistent to its nearest neighbor sample. Euclidian distance is used to measure the distance between the instances of feature space. It can be computed as:

$${D}_{dis}({x}_{1},{y}_{2})={\sum }_{i=1}^{n}\sqrt{({x}_{i}-{y}_{i}{)}^{2}}$$
(16)

These values are arranged by Euclidian as \(dis\le {d}_{i+1}\), where \(i=1, 2, 3 . . . N\).

Performance evaluation

Different parameters are used in the ML arena to assess the model's prediction performance quality. These results are based on true or false values, which are kept in a confusion matrix and are obtained from there when needed. Most of the time, accuracy is used as a measurement parameter to measure the quality of prediction performance in different models. Still, accuracy is insufficient to assess the performance of a predictor only. Therefore, various performance parameters are introduced to predict the performance quality of models. These parameters are Accuracy, Specificity, Sensitivity, and Precision. These values are provided as:

$$Acc=\frac{TN+TP}{TN+TP+FN+FP}$$
(17)
$$Sens=\frac{TP}{TP+FN}$$
(18)
$$Spec=\frac{TN}{TN+FP}$$
(19)
$$\mathit{Pr}ec=\frac{TP}{TP+FP}$$
(20)
$$F1-Score=2\times \frac{(\mathit{Pr}ecision\times \mathit{Re}call)}{(\mathit{Pr}ecision+\mathit{Re}call)}$$
(21)
$$\mathit{Re}call=\frac{TP}{(TP+FN)}$$
(22)

In the above equations, TN, TP, FN, and FP represent True Negative, True Positive, False Negative, and False Positive, respectively.

Results and discussion

We have used Python for experimentation and Google Collab and Jupiter Notebook for implementing the code. The dataset is divided into two parts, with a 70% and 30% ratio. One part is used to train the predictor, and the other is used for testing. 70% of the dataset is used for training, and the remaining 30% is used to test the predictor.

Experimental environment

This sub-section presents the experimental results and analysis. The Scikit-learn library is used to get the default parameters for all classifiers. The Python 3 environment is used. The system had 8 GB of memory and 3.3 GHz processing power.

Split amino acid composition (SAAC)

With SAAC, the evaluation results of four classifiers, i.e., SVM, DT, RF & KNN, on two datasets, i.e., ACP240 and ACP740, based on four performance metrics (Accuracy, Precision, Recall, and F1-score) as shown in Table 2. The RF classifier achieved the highest accuracy scores of 0.90 and 0.83 for the ACP240 and ACP740 datasets, respectively.

Table 2 SAAC feature extraction framework on the ACP240 and ACP 740.

The authors have evaluated the performance of four classifiers: SVM, Decision Tree, RF, and KNN. For each classifier, accuracy, precision, recall, and F1-score are used as evaluation metrics.

The RF classifier achieved the highest accuracy of 0.90 on the ACP240 dataset, followed by SVM with 0.86 accuracy. On the ACP740 dataset, the RF classifier achieved the highest accuracy of 0.83, followed by SVM with 0.83 accuracy. Overall, RF and SVM classifiers performed better than Decision Tree and KNN classifiers on both datasets. Figure 2 shows a detailed analysis of the SAAC Feature Extraction Technique.

Figure 2
figure 2

SAAC Feature Extraction framework with different performance parameters.

Pseudo amino acid composition (PseAAC)

Pseudo Amino Acid Composition (PseAAC) is used to evaluate the results using four classifiers, i.e., SVM, DT, RF & KNN, on two datasets, i.e., ACP240 and ACP740. It is based on five performance metrics (Accuracy, Sensitivity, Specificity, Precision, and F1-score), which are shown in Table 3.

Table 3 PseAAC Feature extraction framework on the ACP240 and ACP740.

The SVM classifier is performed on ACP740, while RF is performed for the ACP240 dataset with 91% and 84% accuracy, respectively. Similarly, DT and KNN had lower accuracies on both datasets.

Overall, the PseAAC feature extraction framework appears to be a promising method for predicting the classification of ACP240 and ACP740 datasets, as shown in Fig. 3.

Figure 3
figure 3

PseAAC feature extraction framework with different performance parameters.

Proposed extended dipeptide composition (EDPC)

The proposed EDPC is used to evaluate the results of four classifiers, i.e., SVM, DT, RF, and KNN, on two datasets, i.e., ACP240 and ACP740, and based on Accuracy, Precision, Recall, and F1-score.

Table 4 shows higher performance for all classifiers on both datasets. Specifically, the SVM classifier achieved 96.6% and 90.3% accuracy on ACP240 and ACP740 datasets, respectively.

Table 4 Proposed EDPC feature extraction framework on the ACP240 and ACP740.

Figure 4 provides interesting insights into the performance of different classifiers as the dataset size increases. It was observed that KNN and SVM classifiers showed improved accuracy with larger datasets. The RF classifier demonstrated a decrease in accuracy. This indicates that RF may not perform as well when dealing with larger datasets of images.

Figure 4
figure 4

Proposed EDPC feature extraction framework with different performance parameters.

The EDPC framework outperforms SAAC and PseAAC due to its comprehensive feature representation, effective noise and redundancy reduction, holistic view of peptide sequences, and robustness across various ML algorithms. Unlike SAAC and PseAAC, EDPC captures extended dipeptide patterns along with the local sequence environment, providing richer and more detailed information. The CD-HIT framework further enhances EDPC by effectively reducing noise and redundant features, resulting in a cleaner and more informative feature set. Additionally, EDPC combines local and global sequence information, offering a holistic view of peptide sequences critical for accurate anticancer peptide identification. This comprehensive approach, coupled with the robustness and adaptability of EDPC across different ML algorithms, ensures superior performance and reliability.

Comparison with state-of-the-art techniques

Our proposed EDPC was compared with state-of-the-art techniques, i.e., XGB-RFE34, and ENACP[36]on the ACP240 and ACP740. The independent test to compare these models was carried out by applying each model to the same datasets, using a consistent evaluation methodology. Table 5 shows the comparison based on performance metrics such as accuracy, precision, recall, and F1-score.

Table 5 Comparison of proposed EDPC with state-of-the-art techniques.

For the ACP240 dataset, the EDPC framework achieves an accuracy of 0.966, higher than the accuracies achieved by XGB-RFE (0.85) and ENACP (0.87). Regarding precision, recall, and F1 score, the EDPC framework outperforms XGB-RFE and is comparable to ENACP.

For the ACP740 dataset, the EDPC framework achieves an accuracy of 0.948, which is higher than XGB-RFE (0.89) and ENACP (0.84), In terms of recall, precision, and F1-score, the EDPC framework outperforms. The XGB-RFE is compared to XGB-RFE and ENACP, and the results are the same.

Overall, the results suggest that the Proposed EDPC feature extraction framework is a promising method for predicting ACPs, for both datasets as shown in Fig. 5.

Figure 5
figure 5

EDPC Comparison with State-of-the-Art Approaches.

The Proposed EDPC feature extraction framework outperforms AAC XGB-RFE and ENACP on both datasets regarding the accuracy, recall, precision, and F1-score. The reasons for its performance are attributed to factors such as improved feature extraction framework EDPC and more effective dataset handling. The main conclusion of the paper is that the proposed EDPC method, as a novel feature extraction technique, significantly enhances the performance of traditional ML algorithms in identifying anticancer peptides. The study demonstrates that EDPC, when integrated with established algorithms like SVM, DT, RF, and KNN, outperforms existing methods such as SAAC and PseAAC. The key finding is that EDPC provides a more detailed and accurate representation of peptide sequences, improving classification accuracy and effectiveness in identifying potential anticancer peptides. This advancement holds promise for the development of new peptide-based cancer therapies.

Statistical significance test

We performed the statistical test on all measures to check whether these improvements were significant or a random chance. The test reports that all the results are significant with 95% confidence intervals. The only exception is the random forest results, which are significant with a 90% confidence interval. The significance test results are presented in Table 6.

Table 6 Statistical Analysis of Model Performance on ACP240 and ACP720 Datasets.

Conclusion and future work

This study proposed a novel feature extraction framework for predicting ACPs named EDPC. This research applied the cluster database with Tolerance (CD-HIT) techniques to remove noise and redundant features. We implemented the proposed framework using the ML framework, achieving slightly improved accuracy compared to the state-of-the-art models. This proposed study highlights the importance of feature extraction and classifiers. In addition, this research found that these are also helpful in achieving optimal accuracy in image classification-related tasks. Our study on the EDPC method involved independent validation using the APC240 and APC740 datasets, ensuring no overlap between training and validation data. The models, trained with algorithms like SVM, DT, RF, and KNN, were tested on unseen data within these datasets. Performance was assessed using metrics like accuracy, sensitivity, and specificity. However, the study is limited by its reliance on these specific datasets, which may not encompass the full diversity of anticancer peptides. Additionally, while EDPC's effectiveness with the employed ML models was demonstrated, its performance with other advanced models, particularly deep learning algorithms, still needs to be explored. Another limitation is the EDPC method’s focus on sequence-based features, potentially overlooking other biologically relevant data such as three-dimensional structural information. The generalizability of our findings is also constrained by the dataset-specific nature of the study, necessitating further validation across a broader range of datasets. Lastly, the computational resources required for EDPC might exceed those for simpler methods, which could be a consideration in some applications. Despite these limitations, our study provides valuable insights into the use of ML for anticancer peptide prediction, setting a foundation for further research to enhance and expand upon these findings. Future research directions can include integrating multiple feature extraction techniques to better capture diverse information from peptide sequences and investigating deep learning approaches. One potential direction is exploring integrating multiple feature extraction techniques to enhance the capturing of diverse information from peptide sequences. By combining different methods, researchers can create more comprehensive representations of the input data, improving prediction accuracy for important applications like anticancer peptide (ACP) prediction. Deep learning approaches such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and transformers offer exciting avenues for further investigation in ACP prediction. By leveraging these deep learning techniques, researchers can enhance the predictive capabilities of ACP models. Alternative pre-processing techniques can be studied to improve dataset refinement. Novel approaches can be explored to handle noise, missing data, and class imbalance issues, resulting in cleaner and more balanced datasets. This, in turn, can lead to more robust and reliable AI models. Future works will include extending to a larger variety of datasets, incorporating tertiary structural information, and using deep learning techniques to improve the proposed EDPC.