Abstract
Gene/feature selection is an essential preprocessing step for creating models using machine learning techniques. It also plays a critical role in different biological applications such as the identification of biomarkers. Although many feature/gene selection algorithms and methods have been introduced, they may suffer from problems such as parameter tuning or low level of performance. To tackle such limitations, in this study, a universal wrapper approach is introduced based on our introduced optimization algorithm and the genetic algorithm (GA). In the proposed approach, candidate solutions have variable lengths, and a support vector machine scores them. To show the usefulness of the method, thirteen classification and regression-based datasets with different properties were chosen from various biological scopes, including drug discovery, cancer diagnostics, clinical applications, etc. Our findings confirmed that the proposed method outperforms most of the other currently used approaches and can also free the users from difficulties related to the tuning of various parameters. As a result, users may optimize their biological applications such as obtaining a biomarker diagnostic kit with the minimum number of genes and maximum separability power.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Introduction
In computational biology, researchers may be involved with the handling of large omics datasets with many features (e.g., genomics, proteomics, metabolomics, etc.)1. For instance, the total number of profiled genes is usually more than 20,000 in human samples, which have been exploited for different purposes such as the detection of biomarkers2. Given that the number of features from proteomics and metabolomics data is potentially much larger3, it is almost impossible to extract a set of biomarkers kit of a manageable size from such large data sets4. For instance, in the field of genomic data, researchers aim to (i) select genes having higher separability power between different states, such as cancerous and noncancerous samples, and (ii), confine them to a reasonable number to be handled5. From the machine learning perspective, features or genes can be categorized into three classes as follows:
-
(i)
Negative features6, which can mislead a learner and reduce its performance. Thus, they must not be selected in the application.
-
(ii)
Neutral features7, which do not play any role in the performance of a learner and can only increase the time of predicting. Like the first group, these features should be avoided.
-
(iii)
Positive features8, which play a determinant role in distinguishing between samples and enhance the performance of a learner. For such features, the feature selection (FS) methods need to be applied since some of the features may have redundant roles as others. Further, a large set of them may be represented by a small set.
Due to the combinatorial nature of FS, it is a nondeterministic polynomial (NP-hard) problem that cannot be solved in a polynomial-time order, in large part because of being accepted by nondeterministic Turing machines9. To overcome the time complexity, heuristic and metaheuristic algorithms, which find acceptable answers to these problems, have been developed10.
In different studies, it has been shown that the metaheuristic algorithms, which do not confine themselves to a specific range of the search space, are generally more suitable than heuristic algorithms11,12,13. In addition, two-step methods may obtain better results than single methods14,15. Therefore, in this study, we capitalized on a two-step method, which is based on a genetic algorithm (GA)16 and our previously developed world competitive contests (WCC) optimization algorithm17, the so-called “GA_WCC method”. In the first step of the GA_WCC method, the GA reduces the total number of features to a minimum upper bound. Next, the WCC selects an optimal subset of features for the desired application. Overall, the GA_WCC method is based on a two-step process for FS, which (i) does not require limiting the number of features to a predefined value, and (ii) outperforms other currently used methods.
Related works
In this section, we discuss the limitation of related approaches works that can be divided into six classes as follows:
-
(i)
Filter methods: These techniques look for the relationships among features and investigate how much information exists in a feature. For this purpose, various mathematical formulas have been proposed, including Entropy18, mutual information19, Fisher score20, correlation21, Laplacian22, etc. Although these approaches are simple and have a low time-complexity, their performance is lower than the other categories23. To tackle such a limitation, wrapper-based method has been developed and are built-upon in this paper.
-
(ii)
Wrapper methods: Unlike the first class, these approaches score the selected features by a learner such as a support vector machine (SVM)24, artificial neural networks (ANN)25, decision tree (DT)26, or others27,28,29. Usually, optimization algorithms are applied to select an optimal subset of features30,31. In different studies, it has been shown that these approaches can achieve remarkable outcomes32, but most of the FS studies do not employ state-of-the-art algorithms for the FS. Here, we used the WCC algorithm for the FS problem.
-
(iii)
Ensemble methods: For the FS, ensemble methods create a learner such as a decision tree33 and selects features in such a way that the learner chooses them for generating a model34,35. Due to their greedy nature, ensemble methods may fall into local optima solutions and do not reach the optimal result. To deal with this limitation, we introduce the WCC algorithm, which features a low probability of falling into local optima.
-
(iv)
Hybrid methods: A combination of the three mentioned methods is applied to the FS problem36. For example, the total number of features is reduced by filter methods, and then an optimal subset of features is chosen by wrapper or ensemble methods37,38. In this class of related works, it is essential to combine the algorithms properly. Therefore, we assumed that a combination of wrapper-wrapper approaches, which merge two wrapper-based algorithms, might be a suitable option for FS.
-
(v)
Hypothesis-based studies: A concept is hypothesized based on prior knowledge and the correctness of which is tested via various experiments on gold-standard datasets39. Although these techniques can help in making a proper decision, they do not prevent the mentioned limitations.
-
vi)
Review works: These works survey different methods such as filter40, wrapper41, ensemble42, hybrid43, and discuss their advantages and disadvantages. Further, they study the role of FS in diverse areas and often constitute the future directions44.
Materials and methods
The datasets
Several datasets with diverse properties have been selected from various sources such as the machine learning repository developed at the University of California Irvine (UCI)45 and published seminar literature sources. For every dataset, the total number of samples is almost the same in its different classes. Table 1 shows the properties of the datasets and describes them.
The proposed method
Our proposed GA_WCC method (Fig. 1) selects the features using a two-step wrapper approach. To this end, as the first step, the Genetic Algorithm (GA) limits the total number of genes or, generally, features, and then the World Competitive Contests (WCC) selects an optimal subset of them from the reduced set of features. Overall, this study has been established based on the following rationale:
-
(i)
The GA starts with a first population of candidate solutions, which each consists of several variables (a subset of features). Unlike other optimization algorithms such as the particle swarm optimization (PSO)53, for the GA, the probability of falling into local optima is minimal, because it produces a high number of candidate sets. However, the convergence speed of GA is usually less than other optimization algorithms (e.g., TLBO54 and FOA55). Hence, this limitation may be addressed when the GA algorithm is combined with other state-of-the-art optimization algorithms. This issue is considered in the present study, by merging the GA and WCC algorithm.
-
(ii)
The WCC begins with a first population of potential answers and applies its all the operators to all the existing candidate solutions (CSs), so it spends more times than other optimization algorithms. Hence, when applying the WCC algorithm to an optimization problem, the total number of CSs is limited. This algorithm has an acceptable convergence speed, but the main limitation of WCC relates to its complex stages, which increase the execution time. Further, for a CS, WCC calls the cost function more than other algorithms due to the nature of its operators. At the last steps of the algorithm, the applied operators make CSs similar to each other, so the convergence speed of the algorithm is reduced (due to the limited number of CSs).
Optimization algorithms differ from each other from a way that they change CSs (the operators of the algorithms). In this study, the WCC algorithm is developed to the FS problem, and its operators are modified to select an optimal subset of features. Given the advantages and disadvantages of the GA and WCC algorithm (the modified version of the WCC algorithm), it is expected that their limitations will be diminished when combined with each other. Inspired by this idea, this study has been designed, and an efficient two step feature selection method based on a wrapper approach has been introduced. As shown in Fig. 1, the GA_WCC method includes several steps as follows:
-
(i)
Applying the genetic algorithm: In the first step of the proposed method, a version of GA is used for the FS56. In different FS studies, CSs are binary, while their length is constant and equal to the total number of features. In this study, for both GA and WCC algorithms, CSs have variable sizes and contain the indices of the selected features. In the optimization scope, the GA is the basis for other optimization algorithms. However, GA generally exhibits a low level of performance in comparison with other algorithms. This notwithstanding, GA produces different CSs, which may help other optimization algorithms to obtain better results57. In Fig. 2, the flowchart of the employed GA is shown, which includes the following main steps:
-
(a)
Creating a first population of CSs: potential answers or CSs are called ‘chromosomes’ in the GA algorithm, and their values of genes are randomly quantified. Every CS incorporates some features, which are chosen from a given feature set (the total number of variables in a CS depends on the size of a dataset). In the proposed method, initially, the CSs have an identical length, but their length may vary from each other because of some repeated values. For instance, in generating initial CSs, it is possible that a CS contains some repeated features. In such a case, only one of the repeated values is remained and the remaining ones are ignored.
-
(b)
Applying GA operators: The GA consists of three main operators named mutation, crossover, and selection. In the employed mutation operator, a variable of a chromosome is randomly selected, and its value is replaced by another randomly selected variable. In the crossover operator, two ranges of the CSs with the same length are randomly chosen, and their contents are exchanged. Finally, in the selection operator, elitism technique has been used, which forms the new population based on the most deserve chromosomes of the current population. In Figs. 3 and 4, the instances of the mutation and crossover operators are depicted, which describes, how the mentioned operators are applied to generating new CSs.
-
(c)
Scoring the selected features: The proposed method is a wrapper method in which a learner evaluates the selected features. Due to the nature of the datasets, which are approximately class-balanced, we basically use the accuracy score (Eq. 1). Other criteria are also inspected in the experimental section.
$$Score = Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}}$$(1)where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. Because of their performance and reasonable time in generating a model, Support Vector Machines (SVMs) have been used for assessing the CSs. Considering the popularity and performance of SVMs, many libraries and packages have been developed accordingly. In this study, the LibSVM library, which is one of the most popular libraries with different options, was employed58.
-
(d)
Investigating the termination condition: when the value of the best CS is remained constant for 10 consequent iterations (generations), the GA is terminated, and all its CSs are passes to the WCC algorithm.
-
(a)
-
(ii)
Applying the proposed algorithm (the WCC): As mentioned before, at the end of the first step, GA passes the created CSs to the proposed algorithm (the flowchart of the WCC algorithm is shown in Fig. 5) and constitutes its first population of CSs. Next, WCC changes the CSs using its operators, which are explained and formulated as follows:
-
(a)
Attacking operator: For a given CS, this operator selects some variables randomly and assigns them new values by chance from [1, n], where n is the total number of the existing features/genes. Equation 2 formulates the attacking operator:
$$\mathop \sum \limits_{i = 1}^{k} \left[ {CS\left( r \right) = rand\left( n \right)} \right]$$(2)where CS, n, and k are a given candidate solution, the total number of features, and an integer random value between 1 and n, respectively. In other words, the k parameter determines how many variables of a CS must be changed. Further, the sigma sign denotes a loop, and r is an integer value between 1 and n as is k. Here, is an example of the attacking operator in Fig. 6.
-
(b)
Transferring operator: Based on the scores (classification accuracy using a given CS), this operator selects several CSs with the highest score (Selected_CS), and then, chooses randomly some values (features) from them. Next, for a given CS, this operator imports the selected values. Equation 3 formulates the mentioned steps. Figure 7 describes the transferring operator in detail.
$$\mathop \sum \limits_{j = 1}^{R} \mathop \sum \limits_{i = 1}^{k} \left[ {CS\left( r \right) = selected_{CSm} \left( {rand\left( l \right)} \right)} \right]$$(3)where \(l\), R, and m are the length of the selected_CS, a random integer value between 1 and the total number of selected CSs, and an index which shows the randomly selected_CS, respectively. Further, other parameters have been described in Eq. 2.
-
(c)
Passing operator: While the transferring and attacking operators may result in large changes in a CS, this operator guarantees low modifications. For this purpose, the operator only selects a variable by chance and changes its value. Equation 4, whose parameters are explained in Eq. 2, formulates the passing operator.
$$CS\left( r \right) = rand\left( n \right)$$(4)Figure 8 illustrates an example of the passing operator and explains how the operator can be applied to the FS problem.
Each of the changes induced by the operators will be accepted if they increase the accuracy score. Further, repeated features may appear by applying the operators. In these situations, only one of the repeated features is kept and all others are removed. Hence, the length of CSs may vary.
-
(a)
-
(d)
Investigating the termination conditions: For terminating the algorithms, several options (e.g., predefined number of iterations, time, accuracy, etc.) can be used. In the present study, two different strategies are chosen for terminating the algorithm. As mentioned before, when the value of accuracy remains about constant in the last ten iterations, the GA is finished. For the WCC algorithm, a predetermined number of iterations has been considered as the termination condition.
Results
To obtain results, a computer system with a dual-core 2.2 GH processor and 12 GB of RAM was employed. Further, our designed FeatureSelect software application and MATLAB programming language were used for the implementations. In this section, all the obtained outcomes refer to results from the five-fold cross-validation technique. For comparing the algorithms and methods, the same conditions were considered. For example, GA, WCC algorithm, and GA_WCC method allowed to run for an identical time for getting the results. The size of populations for the GA, WCC algorithm, and GA_WCC method was determined using a “trial and error” method and their time-consuming parameter, in which the best performance of the algorithms is observed. Based on the outcomes, the population sizes were considered 100, 20, and 100 for the GA, WCC algorithm, and GA_WCC method, respectively. The mutation and crossover rates were set to 30%, because the GA shows a suitable behavior based on them. In addition to the population size parameters, the WCC algorithm consists of the match time (the total number of attempts to change a CS) parameter, which has been set to 2. This parameter was initiated 1 to the GA_WCC method. The outcomes (which encompassed the results of five popular filter FS methods, GA, WCC, a two-step filter-wrapper method (EN_WCC), and the proposed wrapper-wrapper method (GA_WCC)), were divided into the following three categories:
-
(i)
The first category of the results: This class consists of the results obtained from applying the mentioned algorithms and methods to the datasets having more than 50 features and relating to the classification type. Tables 2 and 3 represent the attained outcomes. Also, Fig. 9 depicts the results of the SVM without applying the FS algorithms on the investigated datasets.
Wrapper-based FS methods improve the performance of SVM, whereas Filter-based FS approaches may reduce its performance. Overall, among the filter methods, the entropy-based (EN) FS method has led to more appropriate results than others. Moreover, between GA and WCC algorithms, WCC yields better outcomes. Hence, a combination of EN and WCC (the so-called EN_WCC) is also investigated and compared against the others. For the Cancer dataset, GA_WCC, GA, and WCC have yielded the best solutions. However, GA_WCC and GA classify the data with six features, whereas WCC classifies them with ten attributes. For the Arrhythmia dataset, the proposed approach outperforms others in terms of the total number of features (NOF) and other classification criteria. For the Diabetes dataset, EN_WCC yielded a minimum number of features and have yielded better outcomes than the filter methods, as observed for the cancer dataset. Nevertheless, the data of GA_WCC, WCC, and GA surpass EN_WCC. Similar outcomes are observed for the other datasets. Tables 2 and 3 show that wrapper and two-step methods are more efficient than the filter ones, and their performance can be sorted as GA_WCC, WCC, GA, and EN_WCC, respectively.
For further evaluating the methods, receiving operating characteristic (ROC) curves of the methods are shown in Figs. 10 and 11. The area under the curve (AUC) values of the approaches on the datasets of the first class of the outcomes are shown in Table 4. The two-step and wrapper approaches have remarkable functionality compared to the others, and the proposed method outperforms all of them (Figs. 10, 11, Tables 2, 3, and 4). In another evaluation of the algorithm’s performance, the p-value (PV) measurement was considered (Table 5). To this end, every algorithm was performed in 50 individual executions, and the results of the proposed method (GA_WCC) were considered as a test base. Next, the outcomes of the other algorithms were compared with them. Except for the Cancer dataset, in which the effectiveness of the algorithms is the same, the proposed method has outperformed the others for the remaining datasets. Figure 12 also presents boxplots of the algorithms’ outputs obtained using One-Way ANOVA test. Every execution consists of 100 iterations of the algorithms step. At the end of an iteration, the best acquired accuracy was stored, and the convergence behavior of the algorithms were investigated for the datasets including more than 1000 features (Fig. 13). It was observed that the convergence speed of the proposed method is higher than the GA and WCC algorithms (without merging them). As mentioned before, the combined method can efficiently address the limitations of the GA and WCC algorithm (the low convergence of the GA algorithm and the restricted number of CSs in the WCC) and yield better outcomes when combined than when run individually.
In filter FS methods, determining the total number of features is a challenging problem and plays an essential role in the performance of a model. The results of the five filter approaches are shown in Figs. 14, 15, 16, and 17. These outcomes show the performance of the filter FS methods with a different number of features.
-
(ii)
The second category of results: This section includes the results of the algorithms on the datasets having less than 50 features/attributes. The main goal of this section is to check the effect of FS methods on datasets, which consist of fewer numbers of features. For the small datasets, single wrapper methods do not face special challenges in the FS. Indeed, the mentioned FS methods may obtain the best solution by improving the run time. Hence, in this section, the functionality of the GA and WCC algorithms are inspected. Like for the first part, criteria such as sensitivity, specificity, accuracy, precision, and AUC were investigated. The acquired data are listed in Table 6.
Without applying the GA and WCC algorithms, SVM alone yields 0.5263, 0.6645, and 0.5812 value of accuracy using the fivefold cross-validation technique on the CHD, SHD, and PID datasets, respectively. By applying the algorithms, the value of accuracy improved for the CHD and SHD datasets and remains unchanged for the PID dataset. Further, the total number of features is remarkably reduced. Thus, the obtained models obtained by applying the algorithms operate faster than the model, which uses all the existing features. Having compared GA and WCC algorithms, WCC was seen to lead to a model with lower number of features and higher values of criteria. Therefore, it might be concluded that the state-of-the-art optimization algorithm can get more acceptable data than others.
-
(iii)
The third category of the results: In this section, the outcomes of the methods and algorithms are evaluated on the regression-based dataset (WDBC and drug datasets). To this end, the criteria such as root mean squared error (RMSE) and the correlation between predicted and real labels were calculated and gathered (Table 7). For the filter FS methods, different numbers of features have been tested, and then, their best results were reported. For the wrapper FS approaches, it is not necessary to limit the total number of features and they can regulate it. Even so, they produce variable results in their different executions, so they must be executed at least 30 times and their best-obtained outcomes among from the executions (different accuracy values of the executions) are reported as a solution to the problem. Thus, several criteria were reported for them, based on the acquired results in 50 individual executions, including confidence interval (CI), p-value, standard deviation (STD), etc.
From the run-time perspective, filter FS methods require less time than wrapper approaches, but do not result in improved outcomes. For instance, for the WDBC dataset, the entropy FS approach yields the minimum value of error and the maximum value of correlation between the predicted and real labels, when the total number of features is limited to 13. The value of correlation can be calculated not only for the entropy method but also for others. As the first class of results, the second one also shows the remarkable performance of the proposed approach (GA_WCC) in terms of error, correlation, the total number of selected features, run-time, etc. Besides, WCC and GA present that wrapper FS method may acquire better results than the filter FS approaches. In Fig. 18, the scatter plots of the proposed method on the regression-based datasets are shown.
Discussion
Many methods and algorithms have been proposed for selecting an optimal subset of features, which is indeed an NP-hard problem, particularly in machine learning with a biological context. Besides enhancing the separability power of a model, optimal features improve the speed of a model and may lead to valuable results such as acquiring an optimal kit of biomarkers to be used in applications. In this area, it has been shown that two-step FS approaches lead to better outcomes than single methods59, and wrapper-based FS methods usually outperform filter and embedded FS techniques60. The results of this study also confirm the mentioned observations and allow for the following important key conclusions:
First, wrapper FS methods may obtain an optimal subset of features, which do not require confining the total number of features to a predefined number. Nevertheless, there are some restrictions in determining the total number of selected features. For example, wrapper methods may obtain a subset of attributes with the highest score, while the total number of the selected features may be greater than the required number of features (problem limitations). In this line, we believe that wrapper FS methods are still better than the filter and embedded FS approaches, in large part because they can be formulated in a way to resolve the problem constraints.
Second, limiting the filter methods to a predefined number is a challenging problem and affects the performance of filter FS approaches. The results of this work show that the performance of filter FS approaches vary with the different number of selected features. Thus, this parameter remains a challenge for researchers. However, wrapper methods, which consider a set of features instead of examining each of them separately, do not face this restriction.
Third, the FS is also essential for datasets having a low number of features. In the second part of the results, the performance of wrapper FS methods was investigated on some gold-standard datasets, for which their total number of features is less than 50. Based on other conducted studies61, it seems that the FS has been ignored in these works even though it may improve the performance. For this class of datasets, considering the total number of features, single wrapper methods might be a proper method.
Forth, wrapper-wrapper FS methods may be the best option for selecting an optimal subset of features. In the last decade, different types of hybrid methods have been introduced for the FS problem due to their amazing results. However, most of them combine filter-filter or filter-wrapper approaches and a suitable configuration of wrapper-wrapper methods have been ignored. In the present investigation, a wrapper-wrapper approach based on GA and the proposed WCC-algorithm was introduced, which resulted in superior outcomes compared to the other approaches. The WCC algorithm starts with a first population of CSs and, then, applies its operators to them in order to obtain a better solution to the FS problem. The main difference between the WCC algorithm and other optimization algorithms relates to the steps of the algorithm and its operators. The two-step approaches differ from hybrid methods that merge the optimization algorithms such as the whale optimization algorithm and simulated annealing62. In this study, to obtain an efficient combination of the algorithms, the advantages and limitations of the GA and WCC algorithm were considered. Since the GA produces various CSs, the WCC algorithm confines them to a limited number. Unlike the WCC algorithm, the GA may suffer from low convergence speed and not show a suitable performance relative to other optimization algorithms. Given the mentioned reasons, GA and WCC algorithm were combined, and the results showed that their combination yields better outcomes.
Fifth, the performance of algorithms and methods varies on different datasets. Every algorithm or method has its own attitude relative to the FS problem, so their functionality may differ on various data. Generally, it is impossible to predict a priori, which of the methods or algorithms is suitable for a given problem. Nonetheless, wrapper-wrapper FS approaches appear promising to produce desired results. As a future work, the proposed method can be applied to other algorithms such as the Salp Swarm Algorithm63 and DE64 with considering limitations and disadvantages. Also, the proposed method scores a set of features and does not rank the features of the obtained set. To address this limitation, the proposed approach can be combined with state-of-the-art ranking techniques such as SVM-RFE65,66.
Conclusion
For selecting an optimal subset of features, a two-step wrapper-wrapper FS method based on GA and our proposed algorithm (WCC) was introduced and applied to the thirteen biological datasets with different properties. In comparison with other approaches, it can be concluded that two-step techniques may lead to better results than single-step methods. Furthermore, among the two-step approaches, wrapper-wrapper FS methods may be more appropriate than others. For biological applications, it seems that wrapper approaches are the most convenient and reliable method, in large part because they do not need to be restricted to a predefined number of features. Taken together, based on our findings, wrapper-wrapper FS methods can be used to optimize the FS problems and result in robust and desired outcomes.
References
Ghosh, M., Begum, S., Sarkar, R., Chakraborty, D. & Maulik, U. Recursive memetic algorithm for gene selection in microarray data. Expert Syst. Appl. 116, 172–185 (2019).
Barnabas, G. D. et al. Microvesicle proteomic profiling of uterine liquid biopsy for ovarian cancer early detection. Mol. Cell. Proteomics 18, 865–875 (2019).
Walther, D., Strassburg, K., Durek, P. & Kopka, J. Metabolic pathway relationships revealed by an integrative analysis of the transcriptional and metabolic temperature stress-response dynamics in yeast. Omics J. Integr. Biol. 14, 261–274 (2010).
Frankell, A. M. et al. The landscape of selection in 551 esophageal adenocarcinomas defines genomic biomarkers for the clinic. Nat. Genet. 51, 506–516 (2019).
Long, N. P. et al. Efficacy of integrating a novel 16-gene biomarker panel and intelligence classifiers for differential diagnosis of rheumatoid arthritis and osteoarthritis. J. Clin. Med. 8, 50 (2019).
MotieGhader, H., Masoudi-Sobhanzadeh, Y., Ashtiani, S. H. & Masoudi-Nejad, A. mRNA and microRNA selection for breast cancer molecular subtype stratification using meta-heuristic based algorithms. Genomics 112, 3207–3217 (2020).
Adeli, E., Li, X., Kwon, D., Zhang, Y. & Pohl, K. M. Logistic regression confined by cardinality-constrained sample and feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 42, 1713–1728 (2019).
Salama, M. A. & Hassan, G. A Novel Feature Selection Measure Partnership-Gain. Int. J. Online Biomed. Eng. 15 (2019).
Li, F. et al. Positive-unlabelled learning of glycosylation sites in the human proteome. BMC Bioinf. 20, 1–17 (2019).
Abdel-Basset, M., El-Shahat, D., El-henawy, I., de Albuquerque, V. H. C. & Mirjalili, S. A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection. Expert Syst. Appl. 139, 112824 (2020).
Sayed, G. I., Hassanien, A. E. & Azar, A. T. Feature selection via a novel chaotic crow search algorithm. Neural Comput. Appl. 31, 171–188 (2019).
Masoudi-Sobhanzadeh, Y., Motieghader, H. & Masoudi-Nejad, A. FeatureSelect: a software for feature selection based on machine learning approaches. BMC Bioinf. 20, 170 (2019).
Masoudi-Sobhanzadeh, Y., Omidi, Y., Amanlou, M. & Masoudi-Nejad, A. Trader as a new optimization algorithm predicts drug-target interactions efficiently. Sci. Rep. 9, 1–14 (2019).
Masoudi-Sobhanzadeh, Y., Omidi, Y., Amanlou, M. & Masoudi-Nejad, A. DrugR+: A comprehensive relational database for drug repurposing, combination therapy, and replacement therapy. Comput. Biol. Med. 109, 254–262 (2019).
Rao, H. et al. Feature selection based on artificial bee colony and gradient boosting decision tree. Appl. Soft Comput. 74, 634–642 (2019).
Gronsbell, J., Minnier, J., Yu, S., Liao, K. & Cai, T. Automated feature selection of predictors in electronic medical records data. Biometrics 75, 268–277 (2019).
Masoudi-Sobhanzadeh, Y. & Motieghader, H. World Competitive Contests (WCC) algorithm: A novel intelligent optimization algorithm for biological and non-biological problems. Inf. Med. Unlocked 3, 15–28 (2016).
Mafarja, M. M. & Mirjalili, S. Hybrid binary ant lion optimizer with rough set and approximate entropy reducts for feature selection. Soft. Comput. 23, 6249–6265 (2019).
Rahmaninia, M. & Moradi, P. OSFSMI: online stream feature selection method based on mutual information. Appl. Soft Comput. 68, 733–746 (2018).
Saqlain, S. M. et al. Fisher score and Matthews correlation coefficient-based feature subset selection for heart disease diagnosis using support vector machines. Knowl. Inf. Syst. 58, 139–167 (2019).
Koprinska, I., Rana, M. & Agelidis, V. G. Correlation and instance based feature selection for electricity load forecasting. Knowl.-Based Syst. 82, 29–40 (2015).
Si, L., Wang, Z., Tan, C. & Liu, X. A feature extraction method based on composite multi-scale permutation entropy and Laplacian score for shearer cutting state recognition. Measurement 145, 84–93 (2019).
Pournoor, E., Elmi, N., Masoudi-Sobhanzadeh, Y. & Masoudi-Nejad, A. Disease global behavior: a systematic study of the human interactome network reveals conserved topological features among categories of diseases. Inf. Med. Unlocked 17, 100249 (2019).
Shukla, A. K., Singh, P. & Vardhan, M. A new hybrid wrapper TLBO and SA with SVM approach for gene expression data. Inf. Sci. 503, 238–254 (2019).
Jiang, S., Chin, K.-S., Wang, L., Qu, G. & Tsui, K. L. Modified genetic algorithm-based feature selection combined with pre-trained deep neural network for demand forecasting in outpatient department. Expert Syst. Appl. 82, 216–230 (2017).
Ruggieri, S. Complete search for feature selection in decision trees. J. Mach. Learn. Res. 20, 1–34 (2019).
Pashaei, E., Pashaei, E. & Aydin, N. Gene selection using hybrid binary black hole algorithm and modified binary particle swarm optimization. Genomics 111, 669–686 (2019).
Ali, W. & Ahmed, A. A. Hybrid intelligent phishing website prediction using deep neural networks with genetic algorithm-based feature selection and weighting. IET Inf. Secur. 13, 659–669 (2019).
Sprenger, H. et al. Metabolite and transcript markers for the prediction of potato drought tolerance. Plant Biotechnol. J. 16, 939–950 (2018).
Mafarja, M. & Mirjalili, S. Whale optimization approaches for wrapper feature selection. Appl. Soft Comput. 62, 441–453 (2018).
Masoudi-Sobhanzadeh, Y. & Masoudi-Nejad, A. Synthetic repurposing of drugs against hypertension: a datamining method based on association rules and a novel discrete algorithm. BMC Bioinf. 21, 1–21 (2020).
Faramarzi, A., Heidarinejad, M., Stephens, B. & Mirjalili, S. Equilibrium optimizer: A novel optimization algorithm. Knowl.-Based Syst. 191, 105190 (2020).
Katuwal, R., Suganthan, P. N. & Zhang, L. An ensemble of decision trees with random vector functional link networks for multi-class classification. Appl. Soft Comput. 70, 1146–1153 (2018).
Lopes, M. B. et al. Ensemble outlier detection and gene selection in triple-negative breast cancer data. BMC Bioinf. 19, 1–15 (2018).
Dimitriadis, S. I., Liparas, D., Tsolaki, M. N. & Initiative, A. s. D. N. Random forest feature selection, fusion and ensemble strategy: Combining multiple morphological MRI measures to discriminate among healhy elderly, MCI, cMCI and alzheimer’s disease patients: From the alzheimer’s disease neuroimaging initiative (ADNI) database. J. Neurosci. Methods 302, 14–23 (2018).
MotieGhader, H., Gharaghani, S., Masoudi-Sobhanzadeh, Y. & Masoudi-Nejad, A. Sequential and mixed genetic algorithm and learning automata (SGALA, MGALA) for feature selection in QSAR. IJPR 16, 533 (2017).
Khan, M. A. et al. An optimized method for segmentation and classification of apple diseases based on strong correlation and genetic algorithm based feature selection. IEEE Access 7, 46261–46277 (2019).
Xue, X., Li, C., Cao, S., Sun, J. & Liu, L. Fault diagnosis of rolling element bearings with a two-step scheme based on permutation entropy and random forests. Entropy 21, 96 (2019).
Wang, M. & Barbu, A. Are screening methods useful in feature selection? An empirical study. PloS ONE 14, e0220842 (2019).
Corrales, D. C., Lasso, E., Ledezma, A. & Corrales, J. C. Feature selection for classification tasks: Expert knowledge or traditional methods?. J. Intell. Fuzzy Syst. 34, 2825–2835 (2018).
Urbanowicz, R. J., Meeker, M., La Cava, W., Olson, R. S. & Moore, J. H. Relief-based feature selection: Introduction and review. J. Biomed. Inform. 85, 189–203 (2018).
Brahim, A. B. & Limam, M. Ensemble feature selection for high dimensional data: a new method and a comparative study. Adv. Data Anal. Classif. 12, 937–952 (2018).
Wah, Y. B., Ibrahim, N., Hamid, H. A., Abdul-Rahman, S. & Fong, S. Feature selection methods: case of filter and wrapper approaches for maximising classification accuracy. Pertanika J. Sci. Technol. 26 (2018).
Jović, A., Brkić, K. & Bogunović, N. In 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). 1200–1205 (Ieee).
Asuncion, A. & Newman, D. (Irvine, CA, USA, 2007)
Haghjoo, N. & Masoudi-Nejad, A. Introducing a panel for early detection of lung adenocarcinoma by using data integration of genomics, epigenomics, transcriptomics and proteomics. Exp. Mol. Pathol. 112, 104360 (2020).
47Bulaghi, Z. A., Navin, A. H., Hosseinzadeh, M. & Rezaee, A. World competitive contest-based artificial neural network: A new class-specific method for classification of clinical and biological datasets. Genomics (2020).
48Frank, A. UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml (2010).
Grisoni, F., Consonni, V. & Ballabio, D. Machine learning consensus to predict the binding to the androgen receptor within the CoMPARA project. J. Chem. Inf. Model. 59, 1839–1848 (2019).
50Guyon, I., Gunn, S. R., Ben-Hur, A. & Dror, G. in NIPS, 545–552.
Mahe, P. et al. Automatic identification of mixed bacterial species fingerprints in a MALDI-TOF mass-spectrum. Bioinformatics 30, 1280–1286 (2014).
Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
53Shi, Y. & Eberhart, R. C. in Proceedings of the 1999 congress on evolutionary computation-CEC99 (Cat. No. 99TH8406). 1945–1950 (IEEE).
Azad-Farsani, E., Zare, M., Azizipanah-Abarghooee, R. & Askarian-Abyaneh, H. A new hybrid CPSO-TLBO optimization algorithm for distribution network reconfiguration. J. Intell. Fuzzy Syst. 26, 2175–2184 (2014).
Ghaemi, M. & Feizi-Derakhshi, M.-R. Forest optimization algorithm. Expert Syst. Appl. 41, 6676–6687 (2014).
Dong, H., Li, T., Ding, R. & Sun, J. A novel hybrid genetic algorithm with granular information for feature selection and optimization. Appl. Soft Comput. 65, 33–46 (2018).
Liu, X.-Y., Liang, Y., Wang, S., Yang, Z.-Y. & Ye, H.-S. A hybrid genetic algorithm with wrapper-embedded approaches for feature selection. IEEE Access 6, 22863–22874 (2018).
Chang, C.-C. & Lin, C.-J. LIBSVM: a library for support vector machines. (TIST) 2, 1–27 (2011).
Ge, R. et al. McTwo: a two-step feature selection algorithm based on maximal information coefficient. BMC Bioinf. 17, 1–14 (2016).
Xue, X., Yao, M. & Wu, Z. A novel ensemble-based wrapper method for feature selection using extreme learning machine and genetic algorithm. Knowl. Inf. Syst. 57, 389–412 (2018).
Nahato, K. B., Nehemiah, K. H. & Kannan, A. Hybrid approach using fuzzy sets and extreme learning machine for classifying clinical datasets. Inf. Med. Unlocked 2, 1–11 (2016).
Mafarja, M. M. & Mirjalili, S. Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260, 302–312 (2017).
Mirjalili, S. et al. Salp Swarm Algorithm: a bio-inspired optimizer for engineering design problems. Adv. Eng. Softw. 114, 163–191 (2017).
Karaboğa, D. & Ökdem, S. A simple and global optimization algorithm for engineering problems: differential evolution algorithm. Turk. J. Electr. Eng. Comput. Sci. 12, 53–60 (2004).
Mundra, P. A. & Rajapakse, J. C. SVM-RFE with MRMR filter for gene selection. IEEE Trans. Nanobiosci. 9, 31–37 (2009).
Duan, K.-B., Rajapakse, J. C., Wang, H. & Azuaje, F. Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans. Nanobiosci. 4, 228–234 (2005).
Acknowledgements
The authors would like to appreciate Iranian National Science Founding (INSF) for their supports.
Author information
Authors and Affiliations
Contributions
Y.M.-S.: Conceptualization, implementation, formal analysis, investigation, writing, editing, and revising the manuscript. H.M. Validation, data analysis, Editing-manuscript. Y.O.: Results analysis, validation, Conceptualization, writing, editing, and revising the manuscript. A.M.-N.: Conceptualization, Supervision, Project administration, writing, editing, and revising the manuscript. All authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Masoudi-Sobhanzadeh, Y., Motieghader, H., Omidi, Y. et al. A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications. Sci Rep 11, 3349 (2021). https://doi.org/10.1038/s41598-021-82796-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-021-82796-y
- Springer Nature Limited
This article is cited by
-
Gene selection for high dimensional biological datasets using hybrid island binary artificial bee colony with chaos game optimization
Artificial Intelligence Review (2024)
-
A voting-based machine learning approach for classifying biological and clinical datasets
BMC Bioinformatics (2023)
-
Environmental Remediation of Agrochemicals and Dyes Using Clay Nanocomposites: Review on Operating Conditions, Performance Evaluation, and Machine Learning Applications
Reviews of Environmental Contamination and Toxicology (2023)
-
Applications of Machine Learning to Predict the Chord Length Distribution of Droplets in Oil–Water Dispersions
JOM (2022)