Extended dipeptide composition framework for accurate identification of anticancer peptides

Ullah, Faizan; Salam, Abdu; Nadeem, Muhammad; Amin, Farhan; AlSalman, Hussain; Abrar, Mohammad; Alfakih, Taha

doi:10.1038/s41598-024-68475-8

Extended dipeptide composition framework for accurate identification of anticancer peptides

Article
Open access
Published: 29 July 2024

Volume 14, article number 17381, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Extended dipeptide composition framework for accurate identification of anticancer peptides

Download PDF

Faizan Ullah¹,
Abdu Salam²,
Muhammad Nadeem³,
Farhan Amin⁴,
Hussain AlSalman⁵,
Mohammad Abrar⁶ &
…
Taha Alfakih⁷

386 Accesses
Explore all metrics

Abstract

The identification of anticancer peptides (ACPs) is crucial, especially in the development of peptide-based cancer therapy. The classical models such as Split Amino Acid Composition (SAAC) and Pseudo Amino Acid Composition (PseAAC) lack the incorporation of feature representation. These advancements improve the predictive accuracy and efficiency of ACP identification. Thus, the effort of this research is to propose and develop an advanced framework based on feature extraction. Thus, to achieve this objective herein we propose an Extended Dipeptide Composition (EDPC) framework. The proposed EDPC framework extends the dipeptide composition by considering the local sequence environment information and reforming the CD-HIT framework to remove noise and redundancy. To measure the accuracy, we have performed several experiments. These experiments were employed using four famous machine learning (ML) algorithms named; Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and K Nearest Neighbor (KNN). For comparisons, we have used accuracy, specificity, sensitivity, precision, recall, and F1-Score as evaluation criteria. The reliability of the proposed framework is further evaluated using statistical significance tests. As a result, the proposed EDPC framework exhibited enhanced performance than SAAC and PseAAC, where the SVM model delivered the highest accuracy of 96. 6% and significant enhancements in specificity, sensitivity, precision, and F1-score over multiple datasets. Due to the incorporation of enhanced feature representation and the incorporation of local and global sequence profiles proposed EDPC achieves higher classification performance. The proposed frameworks can deal with noise and also duplicating features. These are accompanied by a wide range of feature representations. Finally, our proposed framework can be used for clinical applications where ACP identification is essential. Future works will include extending to a larger variety of datasets, incorporating tertiary structural information, and using deep learning techniques to improve the proposed EDPC.

ACP-MHCNN: an accurate multi-headed deep-convolutional neural network to predict anticancer peptides

Article Open access 08 December 2021

PLMACPred prediction of anticancer peptides based on protein language model and wavelet denoising transformation

Article Open access 23 July 2024

TargetCPP: accurate prediction of cell-penetrating peptides from optimized multi-scale features using gradient boost decision tree

Article 16 March 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Introduction

Cancer is a deadly disease and is known to be the leading global cause of death. According to the World Health Organization (WHO) 2019 report, cancer was ranked as the first or second most common cause of death in 112 out of 183 countries for individuals under the age of 70¹. In addition, the International Agency for Research on Cancer (IARC) conducted research indicating that cancer resulted in 2018, there were 9.6 million deaths and 18.1 million new cases. This number increased to 18.1 million new cases and 9.9 million cancer-related fatalities in 2020^2,3. The primary structure of antimicrobial peptides (AMP) found that it also has anticancer activities⁴. Thus, AMP was renamed to anticancer peptides (ACP). Briefly, it is a short string of amino acids that usually consists of 5 to 30 amino acids⁵. The key advantage of ACPs, in contrast to anticancer treatments, is that they do not disrupt and affect the body's normal cells. Further benefits include high specificity, easiness of synthesis, modification, and low production cost^6,7. Despite its significant role, the identification of these ACPs has several challenges. Some of them are as follows: first, identifying ACP through clinical experimentation is tedious. Second, it takes a humanly large amount of time. Third, the process is very costly. Therefore, conducting manual experimentation on a large scale is nearly impossible. Automating the identification of ACPs via computational methods based on ML algorithms is very important. Recently, various researchers have developed many intelligent models to automate the prediction of ACP. An intelligent model is proposed in⁸ named “ACP". In this model, they utilized an optimized G-Gap dipeptide composition, formulating peptide sequences. Similarly, silico models were proposed that comprised four different datasets for ACP prediction⁹. The peptide sequences are formulated using the binary profile (BF) model and SAAC. In addition, a hybrid feature-based predictor is developed for feature extraction¹⁰. This model used reduced amino acid composition (RAAC), average chemical shifts (ACS), and amino acid composition (AAC). Their key objective was to achieve higher accuracy using SVM. "iACP-GAEnsC"¹¹ is used for the prediction of ACP. They used a composite encoding method to collect high discriminative features in peptide sequences. In recent years, an intelligent predictor, "TargetACP”, has been developed¹². This model is based on sequential and evolutionary information. In this model, the Synthetic Minority Over-Sampling Framework (SMOTE) was applied to balance samples in minority and majority classes. Two datasets were used, and hence, improved performance was achieved. Additionally, a web server is proposed for the prediction and design of ACPs¹³. A similar model was proposed in¹⁴ which used local kernel alignment and PseAAC to predict ACP. G-gap dipeptide composition was developed for the representation of the sequences¹⁵. They have used maximum relevance-maximum distance (MRMD) to remove the redundant and irrelevant features.

However, all these techniques have limited accuracy and, therefore, need significant improvement. On the other hand, much research has been performed on developing traditional anticancer activities. These include Targeted Therapy, Chemotherapy, Radio Therapy, and Surgery. However, these methods were insufficient because they affect the normal cells in the body. In addition, these methods have many side effects like high cost, hair loss, and reactive impacts. Therefore, researchers were urged to use alternative methods with fewer side effects. Thus, the motivation of our research is to develop a highly efficient ACP feature extraction framework.

The motivation behind proposing the EDPC framework stems from the limitations of existing methods like SAAC and PseAAC, which do not fully capture the complexity of peptide sequences. EDPC addresses these gaps by incorporating extended dipeptide patterns and local sequence environments, leading to a richer and more comprehensive feature set. The expected benefits of EDPC include improved accuracy, reduced noise and redundancy, and a more holistic representation of peptide sequences. This enhancement in feature extraction is anticipated to significantly impact anticancer peptide identification, facilitating more reliable and efficient discovery of potent anticancer peptides and advancing the field of peptide-based cancer therapies. The ultimate objective of this research is as follows.

In this research effort, we propose an innovative feature extraction approach, EDPC, to enhance the identification process of anti-cancer peptides.
We utilize ensemble learning based on different feature extraction models. We used the cluster database at high identity with tolerance (CD-HIT) framework to address the problem of noise and redundant features. The assessment is conducted on the EDPC features extraction in relation to two widely utilized feature extraction methods: SAAC and PseAAC.
We have compared four distinct ML algorithms: Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and K-Nearest Neighbor (KNN) with the proposed model to check the accuracy of the proposed model.
The performance of EDPC is evaluated using state-of-the-art techniques and ML algorithms. After a thorough investigation, we discovered that the proposed EDPC outperformed the accuracy of the current state-of-the-art models. The result shows that SVM outperforms all equivalents with amazing accuracy on the suggested method, i.e., 96.6% and 90.3% for short and large datasets, respectively, even though RF and DT perform better in some circumstances.
This research is used to identify efficient methods for cancer treatment. Our ultimate objective is to expedite the discovery of powerful ACPs, which would open a new direction for cancer treatments.

The rest of the paper is organized as follows: "Literature review" provides an overview of the recent state of the art. "Meterials and methods" discusses the proposed model. "Results and discussion" presents the results and discussion. "Conclusion and future work" concludes the conclusion of this study.

Literature review

ACPs are a promising class of cancer therapeutics. The rational design optimizes peptide properties for enhanced anticancer activity by modifying AAC and computational modeling. This approach requires expertise and extensive validation.

Rational design of ACPs

Rational design is an effective strategy for the development of ACP. It allows for the customization of peptide properties to enhance anticancer activity. This approach involves the design of peptide sequences based on their physicochemical properties, structural features, and interactions with target molecules. One standard rational design method is to modify the AAC of existing peptides to improve their activity. For instance, KR-12, a modified peptide of LL-37, is designed by introducing the cationic and hydrophobic amino acids into the sequence. This is to enhance the anticancer activity and inhibit tumor growth in vivo by inducing apoptosis in cancer cells¹⁴. Another strategy to fight cancer involves developing peptides. This specific target molecule plays a role in cancer development and progression. In this regard, one example is the peptide iRGD. This is designed to target $\alpha v$ Alpha v) integrins that are frequently excessively expressed on the outermost layer of cancer cells. This research demonstrated that iRGD could enhance the delivery of drugs and nanoparticles to tumors and, hence, improve therapeutic efficacy¹⁶. Computational modeling is simulations and docking studies. For instance, it can be used to study the interactions between peptides and target molecules, as well as predict the binding affinity of peptides for specific receptors¹⁶. The ACPs have arisen as a promising new class of cancer therapeutics owing to their high specificity and potency against cancer cells. The design of ACPs is a valuable approach that enables the customization of peptide properties to enhance their anticancer activity¹⁷. A rational design involves the design of peptide sequences based on their physicochemical properties, structural features, and interactions with target molecules¹⁸.

Another standard rational design method is to modify the AAC of existing peptides to improve their activity. For example, KR-12, a modified peptide of LL-37, was designed by introducing cationic and hydrophobic amino acids into the sequence. The objective is to enhance its anticancer activity and inhibit tumor growth in vivo by inducing apoptosis in cancer cells¹⁹. In addition, computational modeling is also utilized in rational design to predict the structure and properties of designed peptides. Molecular dynamics simulations and docking studies can be used to study the interactions between peptides and target molecules, as well as predict the binding affinity of peptides for specific receptors²⁰. Computational modeling was used to design the peptide BP100-D4²¹. Rational design is a promising strategy for developing ACPs that allows optimizing peptide properties to improve their anticancer activity and selectivity. However, it requires expertise in protein chemistry and computational modeling. In addition, extensive experimental validation is required to confirm the activity and safety of designed peptides (Supplementary information).

Sequence-based features for ACP prediction

AAC refers to the frequencies of individual amino acids in a peptide sequence. It is the simplest sequence-based feature. This feature has been employed in numerous studies, including the research conducted by Ahmad et al.²². They used AAC to develop an SVM model for ACP prediction. Similarly, using ML algorithms, Sequeira et al.²³ used the AAC and other sequence-based features to build a prediction model for ACPs. Dipeptide composition is another commonly used sequence-based feature extraction framework that considers the frequency sets of amino acids in a peptide sequence. Meher et al.²⁴ developed a prediction model for ACPs using dipeptide composition and SVM. In another study, Lv et al.²⁵ used dipeptide composition and other features to build a prediction model for ACPs using gradient boosting. Wang et al.²⁶ developed a prediction model for ACPs using physicochemical properties and SVM. Sequence-based features are widely used for ACP prediction, and different features can provide complementary information about the properties of peptides relevant to their anticancer activity. ML algorithms, SVM and RF, have been used to develop prediction models for ACPs based on sequence-based features.

AAC is the simplest and most commonly used sequence-based feature. It refers to the frequencies of individual amino acids in a peptide sequence. Several studies have used AAC to predict ACPs, such as Basith et al.²⁷ developed an SVM model for ACP prediction based on AAC. Similarly, Wei et al.²⁸ used AAC and other sequence-based features to develop a prediction model for ACPs using ML algorithms. Dipeptide composition is another commonly used sequence-based feature that considers the frequency of pairs of amino acids in a peptide sequence. Manavalan et al.²⁹ developed an SVM-based prediction model for ACPs using dipeptide composition. Boopath et al.³⁰ developed an SVM-based prediction model for ACPs using physicochemical properties. While sequence-based features are widely used for ACP prediction, they have several limitations. First, sequence-based features do not consider the three-dimensional structure of peptides, which is important for their activity. Second, sequence-based features do not provide information about the interaction between peptides and target molecules, which is critical for their selectivity. Third, sequence-based features cannot capture the complex relationship between peptide structure and activity. Sequence-based features are commonly used for ACP prediction and have provided valuable insights into peptides' physicochemical and structural properties relevant to their activity. However, these features have several limitations, and the development of more accurate prediction models will require the integration of sequence-based features with other types of features, such as structural and interaction-based features.

Machine learning

SVM is one of the most commonly used ML algorithms used for prediction^31,32. SVMs are supervised learning algorithms with widespread applications in bioinformatics and many other fields. Their function is to identify the optimal hyperplane. Ahmad et al.¹¹ developed an SVM-based prediction model for ACPs using a feature selection. The study used a dataset comprising 298 ACPs and 298 non-ACPs to train and test the SVM model. The authors used AAC and dipeptide composition as features and obtained an accuracy of 94.9%. The achieved results proved that the RF model is effective in predicting ACPs. Another ML algorithm used for ACP prediction is based on deep learning. Gulam et al.³³ developed a prediction model for ACPs using a convolutional neural network (CNN). The authors used a dataset of 902 ACPs and 902 non-ACPs to train and test the CNN model. The authors used the amino acid sequence as input and trained the CNN model to predict whether a peptide is an ACP. ACPs offer a rapid and efficient way to identify potential therapeutic agents for cancer treatment. Therefore, in this study, the authors reviewed some of the recent studies that have employed ML algorithms for ACP prediction. Zhang et al.³⁴ presented an SVM-based prediction XGB-RFE technique. The results showed that the SVM model achieved a high accuracy of 91.5% and an AUC of 0.87. This indicates that the SVM model is effective in predicting ACPs. Similarly, Wu et al.³⁵ presented a novel tool called ACP-MCAM. Zhao et al.³⁶ proposed a prediction model for ACPs using RF and an auto-correlation approach. The authors used a dataset of 1,265 ACPs and 1,265 non-ACPs to train and test the RF model. The authors used a hybrid feature selection approach, combining dipeptide composition and auto-correlation features. The results showed that the RF model achieved an accuracy of 87.9% and an AUC of 0.84, indicating that the RF model is effective in predicting ACPs. ML algorithms, such as RF and SVM, have shown promising results in predicting ACP.

In recent years, many AI algorithms and their applications, such as deep learning, have been developed^37,38. Table 1 summarizes the existing state-of-the-art techniques for ACPs. ACPs involve designing peptides based on specific physicochemical properties, while sequence-based features for ACP prediction involve analyzing the sequence features of known ACPs to predict novel ACPs. On the other hand, ML algorithms for ACP prediction involve developing prediction models using ML algorithms that analyze the sequence and/or physicochemical features.

Table 1 Summary of the existing techniques for ACPs.

Full size table

Materials and methods

In this research, the proposed feature extraction method consists of five phases, as shown in Fig. 1. In the first phase, the dataset is collected and pre-processed using the CD-HIT and iLearn techniques to remove redundancy and noise. In the second phase, features are extracted using SAAC, PseAAC, and the proposed EDPC. The data is pre-processed in the third and fourth phases to improve the dataset's quality. Finally, in the fifth phase, ML techniques, i.e., DT, SVM, RF, and KNN, are applied to classify the data.

Figure 1 presents the proposed model. The dataset was noisy and, therefore, required comprehensive pre-processing. The redundancy and noise are removed in the first stage. CD-HIT and iLearn techniques were utilized for pre-processing. Once the dataset was refined, the features were extracted using SAAC, PseACC, and EDPC methods. Finally, the ML techniques SVM, DT, RF, and KNN were applied to predict ACP.

Data set

In this research study, we have used two datasets ACP240³⁹ and ACP740⁴⁰. These are commonly used datasets in the research community for anticancer peptide prediction^41,42. In this section, the authors provide a detailed description of these datasets. ACP240 is a bi-class dataset that contains anticancer peptides and non-anticancer peptides. This dataset has 1550 instances, 775 of which belong to a negative class, and the same number belongs to the positive instances³⁹. ACP740 dataset comprised 740 samples for anticancer peptides. The sample sizes for positive and negative samples are 376 and 364, respectively⁴⁰. These datasets were chosen for their comprehensive collection of anticancer and non-anticancer peptide sequences, offering a diverse range of samples for analysis. The APC240 dataset includes A anticancer peptides and B non-anticancer peptides, while the APC740 dataset contains C anticancer peptides and D non-anticancer peptides. This dataset is not refined, and comprehensive preprocessing is required to correct it. Ensuring that each class is equally represented in the training and testing sets is essential.

The researchers used the iLearn Plus webserver to pre-process the dataset. It accepts the sequences in a particular Fasta format. The learn web server is designed to take at most 2000 sequences simultaneously. It is an ML platform that provides web-based and graphical interfaces to users. It provides a wide range of algorithms and is used to automate sequence-based feature extraction. CD-HIT is an open-source project developed by Fu, et al.⁴³. The idea is to minimize the size without removing any sequence information. The Cluster database at high identity with tolerance is abbreviated as CD-HIT. After receiving the sequence in Fasta format as input, it outputs a non-redundant set of values.

Feature extraction

Extraction of relevant attributes from primary sequences is a critical task in the development of a computational predictor for the identification of ACPs. To overcome this issue, the researchers in this study suggest SAAC, PseAAC, and EDPC.

Split Amino Acid Composition (SAAC)

SAAC is used for feature representation and is also helpful in overcoming prediction problems⁴⁴ In this technique, the peptide sequence is split into dissimilar portions, and then the occurrence frequency of each part is calculated independently. In SAAC, the peptide sequence is divided into three parts: the N-terminus, the C-terminus, and the region between these two terminuses. It is represented in Eq. (1).

$$P=[{f}_{1}C,......,{f}_{20}C,{f}_{1}int,......, {f}_{20}int, {f}_{1}N,......,{f}_{20}N]$$

(1)

The equation below is used to generate the SAAC feature vector.

$$f(i)=NA({A}_{i}){X}_{n}, \, i=\text{0,1},......,19$$

(2)

$$f(i)=NA({A}_{i})M-Xy-{X}_{c}\text{ i=0,1,......,19}$$

(3)

$$f(i)=NA({A}_{i}){X}_{c},\text{ i=40,41,......,}$$

(4)

where A(A) means amino acid residue, NA(A) is the numbers of A(A) in different splits, M is the length of protein sequence, Xy means residues numbers of N-terminal splits, $Xc$ is the residues numbers of C-terminal splits and $f(i)$ is the i^th SAAC feature vector element. It belongs to one of the segment's 20 frequencies of amino acid residues.

Pseudo Amino Acid Composition (PseAAC)

PseAAC is used to obtain the discrete and numerical features of the peptide sequences⁴⁵. PseAAC was introduced to overcome the issues raised in AAC, such as correlation factors and lack of sequence order information. Chou introduced it in 2001, and it has been used widely in many fields, including protein attribute prediction, for example, Computational Biology, Drug discovery, biomedicine, antifreeze protein and mitochondria localization, etc. PseAAC can be represented as:

$$P=[{f}_{1},......,{f}_{20},{f}_{20}+1,......{f}_{20}+\lambda ]T$$

(5)

where $P$ signifies PseAAC, $T$ represents transposition and ${f}_{1},......,{f}_{20}$ represents the fraction of 20 unique amino acids.

Proposed extended dipeptide composition

The proposed EDPC is a novel feature extraction method developed as an extension to the Dipeptide Composition (DPC) technique. EDPC is based on discrete peptide sequences and uses neighboring amino acids to gather features. EDPC begins with the analysis of peptide sequences. Each peptide sequence is broken down into constituent dipeptides (pairs of amino acids). This step is crucial as it forms the foundation of the feature extraction process. EDPC extends the analysis for each dipeptide by considering surrounding amino acids up to a specified distance. This extension captures the local sequence environment, which is often critical in determining the biological activity of peptides. The algorithm quantifies the presence and frequency of these extended dipeptide patterns within each peptide sequence, transforming qualitative sequence information into a quantitative feature set. It obtains a feature vector of size 400-D by computing the frequency for each of the two adjacent amino acid residues. One of its main advantages is that EDPC combines the global information of each peptide while other feature extraction techniques, like AAC, only compute the frequency of amino acids in which they occur. This global information can be valuable in identifying anti-cancer peptides. To calculate EDPC, first, split the peptide sequence into dipeptides and count the occurrence frequency of each dipeptide sequence in the peptide. Then, compute the probability of observing each dipeptide sequence using the following equations:

$$EDPC(x)=\frac{D{{P}_{r}}{\prime}}{DP\_{T}{\prime}}$$

(6)

$$ P_{r}{\prime} = \frac{{P_{r} - \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\frown}$}}{P} }}{\delta } $$

(7)

$$ AC\left( d \right) = \mathop \sum \limits_{i = 1}^{n - d} P_{i} P_{i + d} $$

(8)

where,

$$ATS=\frac{AC(d)}{N-d}$$

(9)

$$I(d)=\frac{\frac{1}{N-D}{\sum }_{i=1}^{N-d}({P}_{i}-\widehat{P})({P}_{i+d}-\widehat{P}){P}_{i+d}}{\frac{1}{N}{\sum }_{i=1}^{N-d}({P}_{i}-\widehat{P}{)}^{2}}$$

(10)

where EDPC(x) is the occurrence frequency of the peptide sequences. DP_r^' is a single instance out of 400, and DP_T is the total number of dipeptide sequences. Algorithm 1 present the proposed EDPC.

Classification techniques

A model is trained in classification using sample data instances where class labels are known in advance. Although many classification algorithms exist, SVM, RF, DT, and KNN are the most commonly used in the literature with significant results.

Support vector machine (SVM).

Support Vector Machine (SVM) is a powerful supervised learning method and was first developed by Vapnik⁴⁶. In binary class problems, SVM aims to convert the data into a high-class feature vector to compute the optimum separating hyperplane. This optimum hyperplane has the highest margin from its support vectors for the reduction of the rate of the error of a test sample. SVM consists of various kernel functions: linear, polynomial, radial base function (RBF), and sigmoid for classification power. This article used RBF for obtaining the best classification hyperplane, whereas $C$ and $\gamma $ are used for examining the dataset. The kernel width parameter $\gamma $ and the regularization parameter $C$ are determined using the grid search method. The RBF function is defined as:

$$f\left(x\right)=sign(w\times x+b)$$

(11)

where $f(x)$ represents the predicted class label for an input sample $x$. feature of $x$ is represented by $w$, while the bias term is represented by $b$. The $sign( )$ function is used to determine the sign of the linear function w · x + b.

Decision tree

A DT is one of the most powerful and well-known algorithms. It is a supervised learning procedure. DT algorithm can be used for regression as well as classification problems. However, most of the time it is used to solve classification problems. DT possesses a structure like a tree that contains leaf nodes, branches, and root modes. A DT can be mathematically represented as:

$$Entrophy=\sum_{i=1}^{e}-{p}_{i}\times {log}_{2}({p}_{i})$$

(12)

Random forest (RF)

RF is a classification algorithm that is broadly used in bioinformatics. It is effectively used in ML to evaluate various regression and classification problems. An RF algorithm consists of multiple DT algorithms. It can be represented mathematically as:

$$MSE=\frac{1}{N}\sum_{i=1}^{N}{({f}_{i}-{y}_{i})}^{2}$$

(13)

$N$ denotes the number of data points, ${f}_{i}$ denotes the model's output, and ${y}_{i}$ denotes the actual value of the data at point $i$.

$$Gini=1-{\sum }_{i=1}^{e}({p}_{i}{)}^{2}$$

(14)

${p}_{i}$ presents the relative frequency of class &$c$ is the number of classes.

$$Entrophy={\sum }_{i=1}^{e}-{p}_{i}*{\mathit{log}}_{2}({p}_{i})$$

(15)

Entropy is used for the probability of certain results & log function is used for calculation mathematically.

K-nearest neighbor (KNN)

KNN is a well-known classification algorithm used in data mining and bioinformatics for prediction purposes. This type of algorithm works on instance-based ML which is why it is also known as a method of lazy learning KNN is used to classify the data sample into the class, which is most persistent to its nearest neighbor sample. Euclidian distance is used to measure the distance between the instances of feature space. It can be computed as:

$${D}_{dis}({x}_{1},{y}_{2})={\sum }_{i=1}^{n}\sqrt{({x}_{i}-{y}_{i}{)}^{2}}$$

(16)

These values are arranged by Euclidian as $dis\le {d}_{i+1}$, where $i=1, 2, 3 . . . N$.

Performance evaluation

Different parameters are used in the ML arena to assess the model's prediction performance quality. These results are based on true or false values, which are kept in a confusion matrix and are obtained from there when needed. Most of the time, accuracy is used as a measurement parameter to measure the quality of prediction performance in different models. Still, accuracy is insufficient to assess the performance of a predictor only. Therefore, various performance parameters are introduced to predict the performance quality of models. These parameters are Accuracy, Specificity, Sensitivity, and Precision. These values are provided as:

$$Acc=\frac{TN+TP}{TN+TP+FN+FP}$$

(17)

$$Sens=\frac{TP}{TP+FN}$$

(18)

$$Spec=\frac{TN}{TN+FP}$$

(19)

$$\mathit{Pr}ec=\frac{TP}{TP+FP}$$

(20)

$$F1-Score=2\times \frac{(\mathit{Pr}ecision\times \mathit{Re}call)}{(\mathit{Pr}ecision+\mathit{Re}call)}$$

(21)

$$\mathit{Re}call=\frac{TP}{(TP+FN)}$$

(22)

In the above equations, TN, TP, FN, and FP represent True Negative, True Positive, False Negative, and False Positive, respectively.

Results and discussion

We have used Python for experimentation and Google Collab and Jupiter Notebook for implementing the code. The dataset is divided into two parts, with a 70% and 30% ratio. One part is used to train the predictor, and the other is used for testing. 70% of the dataset is used for training, and the remaining 30% is used to test the predictor.

Experimental environment

This sub-section presents the experimental results and analysis. The Scikit-learn library is used to get the default parameters for all classifiers. The Python 3 environment is used. The system had 8 GB of memory and 3.3 GHz processing power.

Split amino acid composition (SAAC)

With SAAC, the evaluation results of four classifiers, i.e., SVM, DT, RF & KNN, on two datasets, i.e., ACP240 and ACP740, based on four performance metrics (Accuracy, Precision, Recall, and F1-score) as shown in Table 2. The RF classifier achieved the highest accuracy scores of 0.90 and 0.83 for the ACP240 and ACP740 datasets, respectively.

Table 2 SAAC feature extraction framework on the ACP240 and ACP 740.

Full size table

The authors have evaluated the performance of four classifiers: SVM, Decision Tree, RF, and KNN. For each classifier, accuracy, precision, recall, and F1-score are used as evaluation metrics.

The RF classifier achieved the highest accuracy of 0.90 on the ACP240 dataset, followed by SVM with 0.86 accuracy. On the ACP740 dataset, the RF classifier achieved the highest accuracy of 0.83, followed by SVM with 0.83 accuracy. Overall, RF and SVM classifiers performed better than Decision Tree and KNN classifiers on both datasets. Figure 2 shows a detailed analysis of the SAAC Feature Extraction Technique.

Pseudo amino acid composition (PseAAC)

Pseudo Amino Acid Composition (PseAAC) is used to evaluate the results using four classifiers, i.e., SVM, DT, RF & KNN, on two datasets, i.e., ACP240 and ACP740. It is based on five performance metrics (Accuracy, Sensitivity, Specificity, Precision, and F1-score), which are shown in Table 3.

Table 3 PseAAC Feature extraction framework on the ACP240 and ACP740.

Full size table

The SVM classifier is performed on ACP740, while RF is performed for the ACP240 dataset with 91% and 84% accuracy, respectively. Similarly, DT and KNN had lower accuracies on both datasets.

Overall, the PseAAC feature extraction framework appears to be a promising method for predicting the classification of ACP240 and ACP740 datasets, as shown in Fig. 3.

Proposed extended dipeptide composition (EDPC)

The proposed EDPC is used to evaluate the results of four classifiers, i.e., SVM, DT, RF, and KNN, on two datasets, i.e., ACP240 and ACP740, and based on Accuracy, Precision, Recall, and F1-score.

Table 4 shows higher performance for all classifiers on both datasets. Specifically, the SVM classifier achieved 96.6% and 90.3% accuracy on ACP240 and ACP740 datasets, respectively.

Table 4 Proposed EDPC feature extraction framework on the ACP240 and ACP740.

Full size table

Figure 4 provides interesting insights into the performance of different classifiers as the dataset size increases. It was observed that KNN and SVM classifiers showed improved accuracy with larger datasets. The RF classifier demonstrated a decrease in accuracy. This indicates that RF may not perform as well when dealing with larger datasets of images.

The EDPC framework outperforms SAAC and PseAAC due to its comprehensive feature representation, effective noise and redundancy reduction, holistic view of peptide sequences, and robustness across various ML algorithms. Unlike SAAC and PseAAC, EDPC captures extended dipeptide patterns along with the local sequence environment, providing richer and more detailed information. The CD-HIT framework further enhances EDPC by effectively reducing noise and redundant features, resulting in a cleaner and more informative feature set. Additionally, EDPC combines local and global sequence information, offering a holistic view of peptide sequences critical for accurate anticancer peptide identification. This comprehensive approach, coupled with the robustness and adaptability of EDPC across different ML algorithms, ensures superior performance and reliability.

Comparison with state-of-the-art techniques

Our proposed EDPC was compared with state-of-the-art techniques, i.e., XGB-RFE³⁴, and ENACP[36]on the ACP240 and ACP740. The independent test to compare these models was carried out by applying each model to the same datasets, using a consistent evaluation methodology. Table 5 shows the comparison based on performance metrics such as accuracy, precision, recall, and F1-score.

Table 5 Comparison of proposed EDPC with state-of-the-art techniques.

Full size table

For the ACP240 dataset, the EDPC framework achieves an accuracy of 0.966, higher than the accuracies achieved by XGB-RFE (0.85) and ENACP (0.87). Regarding precision, recall, and F1 score, the EDPC framework outperforms XGB-RFE and is comparable to ENACP.

For the ACP740 dataset, the EDPC framework achieves an accuracy of 0.948, which is higher than XGB-RFE (0.89) and ENACP (0.84), In terms of recall, precision, and F1-score, the EDPC framework outperforms. The XGB-RFE is compared to XGB-RFE and ENACP, and the results are the same.

Overall, the results suggest that the Proposed EDPC feature extraction framework is a promising method for predicting ACPs, for both datasets as shown in Fig. 5.

The Proposed EDPC feature extraction framework outperforms AAC XGB-RFE and ENACP on both datasets regarding the accuracy, recall, precision, and F1-score. The reasons for its performance are attributed to factors such as improved feature extraction framework EDPC and more effective dataset handling. The main conclusion of the paper is that the proposed EDPC method, as a novel feature extraction technique, significantly enhances the performance of traditional ML algorithms in identifying anticancer peptides. The study demonstrates that EDPC, when integrated with established algorithms like SVM, DT, RF, and KNN, outperforms existing methods such as SAAC and PseAAC. The key finding is that EDPC provides a more detailed and accurate representation of peptide sequences, improving classification accuracy and effectiveness in identifying potential anticancer peptides. This advancement holds promise for the development of new peptide-based cancer therapies.

Statistical significance test

We performed the statistical test on all measures to check whether these improvements were significant or a random chance. The test reports that all the results are significant with 95% confidence intervals. The only exception is the random forest results, which are significant with a 90% confidence interval. The significance test results are presented in Table 6.

Table 6 Statistical Analysis of Model Performance on ACP240 and ACP720 Datasets.

Full size table

Conclusion and future work

This study proposed a novel feature extraction framework for predicting ACPs named EDPC. This research applied the cluster database with Tolerance (CD-HIT) techniques to remove noise and redundant features. We implemented the proposed framework using the ML framework, achieving slightly improved accuracy compared to the state-of-the-art models. This proposed study highlights the importance of feature extraction and classifiers. In addition, this research found that these are also helpful in achieving optimal accuracy in image classification-related tasks. Our study on the EDPC method involved independent validation using the APC240 and APC740 datasets, ensuring no overlap between training and validation data. The models, trained with algorithms like SVM, DT, RF, and KNN, were tested on unseen data within these datasets. Performance was assessed using metrics like accuracy, sensitivity, and specificity. However, the study is limited by its reliance on these specific datasets, which may not encompass the full diversity of anticancer peptides. Additionally, while EDPC's effectiveness with the employed ML models was demonstrated, its performance with other advanced models, particularly deep learning algorithms, still needs to be explored. Another limitation is the EDPC method’s focus on sequence-based features, potentially overlooking other biologically relevant data such as three-dimensional structural information. The generalizability of our findings is also constrained by the dataset-specific nature of the study, necessitating further validation across a broader range of datasets. Lastly, the computational resources required for EDPC might exceed those for simpler methods, which could be a consideration in some applications. Despite these limitations, our study provides valuable insights into the use of ML for anticancer peptide prediction, setting a foundation for further research to enhance and expand upon these findings. Future research directions can include integrating multiple feature extraction techniques to better capture diverse information from peptide sequences and investigating deep learning approaches. One potential direction is exploring integrating multiple feature extraction techniques to enhance the capturing of diverse information from peptide sequences. By combining different methods, researchers can create more comprehensive representations of the input data, improving prediction accuracy for important applications like anticancer peptide (ACP) prediction. Deep learning approaches such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and transformers offer exciting avenues for further investigation in ACP prediction. By leveraging these deep learning techniques, researchers can enhance the predictive capabilities of ACP models. Alternative pre-processing techniques can be studied to improve dataset refinement. Novel approaches can be explored to handle noise, missing data, and class imbalance issues, resulting in cleaner and more balanced datasets. This, in turn, can lead to more robust and reliable AI models. Future works will include extending to a larger variety of datasets, incorporating tertiary structural information, and using deep learning techniques to improve the proposed EDPC.

Data availability

Data is provided within the manuscript or supplementary information files.

References

Sung, H. et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. https://doi.org/10.3322/caac.21660 (2021).
Article PubMed Google Scholar
Dyba, T. et al. The European cancer burden in 2020: Incidence and mortality estimates for 40 countries and 25 major cancers. Eur. J. Cancer 157(5), 308–347 (2021).
Article PubMed PubMed Central Google Scholar
Ullah, F., Salam, A., Abrar, M. & Amin, F. Brain tumor segmentation using a patch-based convolutional neural network: A big data analysis approach. Mathematics 11(10), 16–35 (2023).
Google Scholar
Boman, H. G. Antibacterial peptides: basic facts and emerging concepts. J. Internal Med. 254(3), 197–215 (2003).
Article CAS PubMed Google Scholar
Lane, N. & Kahanda, I. DeepACPpred: A Novel Hybrid CNN-RNN Architecture for Predicting Anti-Cancer Peptides,". In International Conference on Practical Applications of Computational Biology & Bioinformatics (eds Panuccio, G. et al.) (Springer International Publishing, 2021).
Google Scholar
Haney, E. F., Mansour, S. C. & Hancock, A. P. Antimicrobial peptides: an introduction, Antimicrobial peptides: methods and protocols (Springer, 2017).
Google Scholar
Li, F.-M. & Wang, X.-Q. Identifying anticancer peptides by using improved hybrid compositions. Sci. Rep. 6, 33910 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, W., Ding, H., Feng, P., Lin, H. & Chou, K. C. "iACP: A sequence-based tool for identifying anticancer peptides. Oncotarget 7(13), 16–28 (2016).
Article Google Scholar
Tyagi, A. et al. In silico models for designing and discovering novel anticancer peptides. Sci. Rep. 3, 1–8 (2013).
Article Google Scholar
Li, F.-M. & Wang, X.-Q. Identifying anticancer peptides by using improved hybrid compositions. Sci. Rep. 6(3), 1–6 (2016).
Google Scholar
Akbar, S., Hayat, M., Iqbal, M. & Jan, M. A. iACP-GAEnsC: Evolutionary genetic algorithm-based ensemble classification of anticancer peptides by utilizing hybrid feature space. Artif. Intell. Med. 79(2017), 62–70 (2017).
Article PubMed Google Scholar
Kabir, M. et al. Intelligent computational method for discrimination of anticancer peptides by incorporating sequential and evolutionary profiles information. Chem. Intell. Lab. Syst. 182(79), 158–165 (2018).
Article CAS Google Scholar
Vijayakumar, S. & Ptv, L. ACPP: A web server for prediction and design of anti-cancer peptides. Int. J. Peptide Res. Ther. 21(2015), 99–106 (2015).
Article CAS Google Scholar
Alsanea, M. et al. To Assist Oncologists: An efficient machine learning-based approach for anti-cancer peptides classification. Sensors 22(11), 1–19 (2022).
Article Google Scholar
Lin, H. et al. Predicting cancerlectins by the optimal g-gap dipeptides. Sci. Rep. 5(1), 1–9 (2015).
Article Google Scholar
Sugahara, K. N. et al. Tissue-penetrating delivery of compounds and nanoparticles into tumors. Cancer Cell 16(6), 510–520 (2009).
Article CAS PubMed PubMed Central Google Scholar
Wang, G. Antimicrobial peptides: discovery, design and novel therapeutic strategies (CABI, 2010).
Book Google Scholar
Hou, H. et al. A review of bioactive peptides: chemical modification, structural characterization and therapeutic applications. J. Biomed. Nanotechnol. 16(12), 1687–1718 (2020).
Article CAS PubMed Google Scholar
Tossi, A., Sandri, L. & Giangaspero, A. Amphipathic, α-helical antimicrobial peptides. Peptide Sci. 55(1), 4–30 (2000).
Article CAS Google Scholar
Nielsen, M. & Andreatta, M. NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets. Genome Med. 8(1), 1–9 (2016).
Article Google Scholar
Rauf, A. et al. Comprehensive review on naringenin and naringin polyphenols as a potent anticancer agent. Environ. Sci. Poll. Res. 29(21), 31025–31041 (2022).
Article CAS Google Scholar
Ahmad, A. et al. Identification of antioxidant proteins using a discriminative intelligent model of k-space amino acid pairs-based descriptors incorporating with ensemble feature selection. Biocybernetics Biomed. Eng. 42(10), 727–735 (2022).
Article Google Scholar
Sequeira, A. M. F. T. Building an automated platform for the classification of peptides/proteins using machine learning (Springer International Publishing, 2021).
Google Scholar
Meher, P. K., Sahu, T. K., Saini, V. & Rao, A. R. Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC. Sci. Rep. 7(5), 1–12 (2017).
CAS Google Scholar
Lv, Z., Wang, D., Ding, H., Zhong, B. & Xu, L. Escherichia coli DNA N-4-methycytosine site prediction accuracy improved by light gradient boosting machine feature selection technology. IEEE Access 8(10), 14851–14859 (2020).
Article Google Scholar
Wang, G., Li, X. & Wang, Z. APD3: the antimicrobial peptide database as a tool for research and education. Nucleic Acids Res. 44(5), D1087–D1093 (2016).
Article MathSciNet CAS PubMed Google Scholar
Basith, S., Manavalan, B., Shin, T. H., Lee, D. Y. & Lee, G. Evolution of machine learning algorithms in the prediction and design of anticancer peptides. Curr. Protein Peptide Sci. 21(21), 1242–1250 (2020).
Article CAS Google Scholar
Wei, L., Zhou, C., Chen, H., Song, J. & Su, R. ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics 34(23), 4007–4016 (2018).
Article CAS PubMed PubMed Central Google Scholar
Manavalan, B. et al. MLACP: machine-learning-based prediction of anticancer peptides. Oncotarget 8(21), 77–121 (2017).
Google Scholar
Boopathi, V. et al. mACPpred: a support vector machine-based meta-predictor for identification of anticancer peptides. Int. J. Mol. Sci. 20(10), 19–64 (2019).
Google Scholar
Nasrolahzadeh, M., Rahnamayan, S. & Haddadnia, J. Alzheimer’s disease diagnosis using genetic programming based on higher order spectra features. Mach. Learn. Appl. 7, 12–25 (2022).
Google Scholar
Baltrušaitis, T., Ahuja, C. & Morency, L. P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2018).
Article PubMed Google Scholar
Ghulam, A. et al. ACP-2DCNN: deep learning-based model for improving prediction of anticancer peptides using two-dimensional convolutional neural network. Chemometrics Intell. Lab. Syst. 226(6), 1–19 (2022).
Google Scholar
Zhang, Q. et al. StackPDB: predicting DNA-binding proteins based on XGB-RFE feature optimization and stacked ensemble classifier. Appl. Soft Comput. 99(10), 10–21 (2021).
Google Scholar
Wu, X., Zeng, W., Lin, F., Xu, P. & Li, X. Anticancer Peptide prediction via multi-kernel cnn and attention model. Front. Genetics https://doi.org/10.3389/fgene.2022.887894 (2022).
Article Google Scholar
Ge, R. et al. Enacp: An ensemble learning model for identification of anticancer peptides. Front. Genetics 11, 57–60 (2020).
Article Google Scholar
Huang, L. et al. Multi-scale feature fusion convolutional neural network for indoor small target detection. Front. Neurorobotics 16(6), 1–13 (2022).
Google Scholar
Yun, J. et al. Real-time target detection method based on lightweight convolutional neural network. Front. Bioengineering Biotechnol. 4(10), 1–13 (2022).
Google Scholar
Flouris, I. et al. Issues in complex event processing: Status and prospects in the Big Data era. J. Syst. Softw. 127, 217–236 (2017).
Article Google Scholar
Stefanowski, J., Krawiec, K. & Wrembel, R. Exploring complex and big data. Int. J. Appl. Mathematics Computer Sci. 27(10), 669–679 (2017).
Article MathSciNet Google Scholar
Yi, H. C. et al. ACP-DL: A DL long short-term memory model to predict anticancer peptides using high-efficiency feature representation. Mol. Ther. Nucleic Acids 17, 1–9 (2019).
Article CAS PubMed PubMed Central Google Scholar
Höhn, J. et al. Combining CNN-based histologic whole slide image analysis and patient data to improve skin cancer classification. Eur. J. Cancer 149(3), 94–101 (2021).
Article PubMed Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(11), 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chou, K. C. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteomics 6(6), 262–274 (2009).
Article CAS Google Scholar
Chou, K. C. "Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins Struct. Funct. Bioinform. 43(11), 246–255 (2001).
Article CAS Google Scholar
Cortes, C. & Vapnik, V. Support vector machine. Mach. Learn. 20(1), 273–297 (1995).
Article Google Scholar

Download references

Acknowledgements

This research was supported by the Researchers Supporting Project number (RSP2024R244), King Saud University, Riyadh, Saudi Arabia.

Author information

Authors and Affiliations

Department of Computer Science, Bacha Khan University, Charsadda, 24420, Pakistan
Faizan Ullah
Department of Computer Science, Abdul Wali Khan University, Mardan, 23200, Pakistan
Abdu Salam
Department of Computer Science and Software Engineering, International Islamic University, Islamabad, 44000, Pakistan
Muhammad Nadeem
School of Computer Science and Engineering, Yeungnam University, Gyeongsan, 38541, Korea
Farhan Amin
Department of Computer Science, College of Computer and Information Sciences, King Saud University, 11543, Riyadh, Saudi Arabia
Hussain AlSalman
Faculty of Computer Studies, Arab Open University, Muscat, Oman
Mohammad Abrar
Department of Information Systems, College of Computer and Information Sciences, King Saud University, 11543, Riyadh, Saudi Arabia
Taha Alfakih

Authors

Faizan Ullah
View author publications
You can also search for this author in PubMed Google Scholar
Abdu Salam
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Nadeem
View author publications
You can also search for this author in PubMed Google Scholar
Farhan Amin
View author publications
You can also search for this author in PubMed Google Scholar
Hussain AlSalman
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Abrar
View author publications
You can also search for this author in PubMed Google Scholar
Taha Alfakih
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Faizan Ullah.Farhan Amin. and Abdusalam.Muhammad abrar wrote the main manuscript text and Muhammad nadeem.Hussain alsalman and Taha Alfakih prepared Figs. 1–3. All authors reviewed the manuscript."

Corresponding authors

Correspondence to Farhan Amin or Hussain AlSalman.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ullah, F., Salam, A., Nadeem, M. et al. Extended dipeptide composition framework for accurate identification of anticancer peptides. Sci Rep 14, 17381 (2024). https://doi.org/10.1038/s41598-024-68475-8

Download citation

Received: 19 June 2024
Accepted: 24 July 2024
Published: 29 July 2024
DOI: https://doi.org/10.1038/s41598-024-68475-8
Springer Nature Limited

Extended dipeptide composition framework for accurate identification of anticancer peptides

Abstract

Similar content being viewed by others

ACP-MHCNN: an accurate multi-headed deep-convolutional neural network to predict anticancer peptides

PLMACPred prediction of anticancer peptides based on protein language model and wavelet denoising transformation

TargetCPP: accurate prediction of cell-penetrating peptides from optimized multi-scale features using gradient boost decision tree

Explore related subjects

Introduction

Literature review

Rational design of ACPs

Sequence-based features for ACP prediction

Machine learning

Materials and methods

Data set

Feature extraction

Split Amino Acid Composition (SAAC)

Pseudo Amino Acid Composition (PseAAC)

Proposed extended dipeptide composition

Classification techniques

Support vector machine (SVM).

Decision tree

Random forest (RF)

K-nearest neighbor (KNN)

Performance evaluation

Results and discussion

Experimental environment

Split amino acid composition (SAAC)

Pseudo amino acid composition (PseAAC)

Proposed extended dipeptide composition (EDPC)

Comparison with state-of-the-art techniques

Statistical significance test

Conclusion and future work

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation