A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications

Masoudi-Sobhanzadeh, Yosef; Motieghader, Habib; Omidi, Yadollah; Masoudi-Nejad, Ali

doi:10.1038/s41598-021-82796-y

A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications

Article
Open access
Published: 08 February 2021

Volume 11, article number 3349, (2021)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications

Download PDF

Yosef Masoudi-Sobhanzadeh¹,
Habib Motieghader^2,3,
Yadollah Omidi⁴ &
…
Ali Masoudi-Nejad⁵

3928 Accesses
14 Citations
2 Altmetric
Explore all metrics

Abstract

Gene/feature selection is an essential preprocessing step for creating models using machine learning techniques. It also plays a critical role in different biological applications such as the identification of biomarkers. Although many feature/gene selection algorithms and methods have been introduced, they may suffer from problems such as parameter tuning or low level of performance. To tackle such limitations, in this study, a universal wrapper approach is introduced based on our introduced optimization algorithm and the genetic algorithm (GA). In the proposed approach, candidate solutions have variable lengths, and a support vector machine scores them. To show the usefulness of the method, thirteen classification and regression-based datasets with different properties were chosen from various biological scopes, including drug discovery, cancer diagnostics, clinical applications, etc. Our findings confirmed that the proposed method outperforms most of the other currently used approaches and can also free the users from difficulties related to the tuning of various parameters. As a result, users may optimize their biological applications such as obtaining a biomarker diagnostic kit with the minimum number of genes and maximum separability power.

FeatureSelect: a software for feature selection based on machine learning approaches

Article Open access 03 April 2019

Feature selection techniques for microarray datasets: a comprehensive review, taxonomy, and future directions

Article 24 October 2022

An Optimize Gene Selection Approach for Cancer Classification Using Hybrid Feature Selection Methods

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Introduction

In computational biology, researchers may be involved with the handling of large omics datasets with many features (e.g., genomics, proteomics, metabolomics, etc.)¹. For instance, the total number of profiled genes is usually more than 20,000 in human samples, which have been exploited for different purposes such as the detection of biomarkers². Given that the number of features from proteomics and metabolomics data is potentially much larger³, it is almost impossible to extract a set of biomarkers kit of a manageable size from such large data sets⁴. For instance, in the field of genomic data, researchers aim to (i) select genes having higher separability power between different states, such as cancerous and noncancerous samples, and (ii), confine them to a reasonable number to be handled⁵. From the machine learning perspective, features or genes can be categorized into three classes as follows:

(i)
Negative features⁶, which can mislead a learner and reduce its performance. Thus, they must not be selected in the application.
(ii)
Neutral features⁷, which do not play any role in the performance of a learner and can only increase the time of predicting. Like the first group, these features should be avoided.
(iii)
Positive features⁸, which play a determinant role in distinguishing between samples and enhance the performance of a learner. For such features, the feature selection (FS) methods need to be applied since some of the features may have redundant roles as others. Further, a large set of them may be represented by a small set.

Due to the combinatorial nature of FS, it is a nondeterministic polynomial (NP-hard) problem that cannot be solved in a polynomial-time order, in large part because of being accepted by nondeterministic Turing machines⁹. To overcome the time complexity, heuristic and metaheuristic algorithms, which find acceptable answers to these problems, have been developed¹⁰.

In different studies, it has been shown that the metaheuristic algorithms, which do not confine themselves to a specific range of the search space, are generally more suitable than heuristic algorithms^11,12,13. In addition, two-step methods may obtain better results than single methods^14,15. Therefore, in this study, we capitalized on a two-step method, which is based on a genetic algorithm (GA)¹⁶ and our previously developed world competitive contests (WCC) optimization algorithm¹⁷, the so-called “GA_WCC method”. In the first step of the GA_WCC method, the GA reduces the total number of features to a minimum upper bound. Next, the WCC selects an optimal subset of features for the desired application. Overall, the GA_WCC method is based on a two-step process for FS, which (i) does not require limiting the number of features to a predefined value, and (ii) outperforms other currently used methods.

Related works

In this section, we discuss the limitation of related approaches works that can be divided into six classes as follows:

(i)
Filter methods: These techniques look for the relationships among features and investigate how much information exists in a feature. For this purpose, various mathematical formulas have been proposed, including Entropy¹⁸, mutual information¹⁹, Fisher score²⁰, correlation²¹, Laplacian²², etc. Although these approaches are simple and have a low time-complexity, their performance is lower than the other categories²³. To tackle such a limitation, wrapper-based method has been developed and are built-upon in this paper.
(ii)
Wrapper methods: Unlike the first class, these approaches score the selected features by a learner such as a support vector machine (SVM)²⁴, artificial neural networks (ANN)²⁵, decision tree (DT)²⁶, or others^27,28,29. Usually, optimization algorithms are applied to select an optimal subset of features^30,31. In different studies, it has been shown that these approaches can achieve remarkable outcomes³², but most of the FS studies do not employ state-of-the-art algorithms for the FS. Here, we used the WCC algorithm for the FS problem.
(iii)
Ensemble methods: For the FS, ensemble methods create a learner such as a decision tree³³ and selects features in such a way that the learner chooses them for generating a model^34,35. Due to their greedy nature, ensemble methods may fall into local optima solutions and do not reach the optimal result. To deal with this limitation, we introduce the WCC algorithm, which features a low probability of falling into local optima.
(iv)
Hybrid methods: A combination of the three mentioned methods is applied to the FS problem³⁶. For example, the total number of features is reduced by filter methods, and then an optimal subset of features is chosen by wrapper or ensemble methods^37,38. In this class of related works, it is essential to combine the algorithms properly. Therefore, we assumed that a combination of wrapper-wrapper approaches, which merge two wrapper-based algorithms, might be a suitable option for FS.
(v)
Hypothesis-based studies: A concept is hypothesized based on prior knowledge and the correctness of which is tested via various experiments on gold-standard datasets³⁹. Although these techniques can help in making a proper decision, they do not prevent the mentioned limitations.
vi)
Review works: These works survey different methods such as filter⁴⁰, wrapper⁴¹, ensemble⁴², hybrid⁴³, and discuss their advantages and disadvantages. Further, they study the role of FS in diverse areas and often constitute the future directions⁴⁴.

Materials and methods

The datasets

Several datasets with diverse properties have been selected from various sources such as the machine learning repository developed at the University of California Irvine (UCI)⁴⁵ and published seminar literature sources. For every dataset, the total number of samples is almost the same in its different classes. Table 1 shows the properties of the datasets and describes them.

Table 1 The properties of the datasets.

Full size table

The proposed method

Our proposed GA_WCC method (Fig. 1) selects the features using a two-step wrapper approach. To this end, as the first step, the Genetic Algorithm (GA) limits the total number of genes or, generally, features, and then the World Competitive Contests (WCC) selects an optimal subset of them from the reduced set of features. Overall, this study has been established based on the following rationale:

(i)
The GA starts with a first population of candidate solutions, which each consists of several variables (a subset of features). Unlike other optimization algorithms such as the particle swarm optimization (PSO)⁵³, for the GA, the probability of falling into local optima is minimal, because it produces a high number of candidate sets. However, the convergence speed of GA is usually less than other optimization algorithms (e.g., TLBO⁵⁴ and FOA⁵⁵). Hence, this limitation may be addressed when the GA algorithm is combined with other state-of-the-art optimization algorithms. This issue is considered in the present study, by merging the GA and WCC algorithm.
(ii)
The WCC begins with a first population of potential answers and applies its all the operators to all the existing candidate solutions (CSs), so it spends more times than other optimization algorithms. Hence, when applying the WCC algorithm to an optimization problem, the total number of CSs is limited. This algorithm has an acceptable convergence speed, but the main limitation of WCC relates to its complex stages, which increase the execution time. Further, for a CS, WCC calls the cost function more than other algorithms due to the nature of its operators. At the last steps of the algorithm, the applied operators make CSs similar to each other, so the convergence speed of the algorithm is reduced (due to the limited number of CSs).

Optimization algorithms differ from each other from a way that they change CSs (the operators of the algorithms). In this study, the WCC algorithm is developed to the FS problem, and its operators are modified to select an optimal subset of features. Given the advantages and disadvantages of the GA and WCC algorithm (the modified version of the WCC algorithm), it is expected that their limitations will be diminished when combined with each other. Inspired by this idea, this study has been designed, and an efficient two step feature selection method based on a wrapper approach has been introduced. As shown in Fig. 1, the GA_WCC method includes several steps as follows:

(i)
Applying the genetic algorithm: In the first step of the proposed method, a version of GA is used for the FS⁵⁶. In different FS studies, CSs are binary, while their length is constant and equal to the total number of features. In this study, for both GA and WCC algorithms, CSs have variable sizes and contain the indices of the selected features. In the optimization scope, the GA is the basis for other optimization algorithms. However, GA generally exhibits a low level of performance in comparison with other algorithms. This notwithstanding, GA produces different CSs, which may help other optimization algorithms to obtain better results⁵⁷. In Fig. 2, the flowchart of the employed GA is shown, which includes the following main steps:
1. (a)
  Creating a first population of CSs: potential answers or CSs are called ‘chromosomes’ in the GA algorithm, and their values of genes are randomly quantified. Every CS incorporates some features, which are chosen from a given feature set (the total number of variables in a CS depends on the size of a dataset). In the proposed method, initially, the CSs have an identical length, but their length may vary from each other because of some repeated values. For instance, in generating initial CSs, it is possible that a CS contains some repeated features. In such a case, only one of the repeated values is remained and the remaining ones are ignored.
2. (b)
  Applying GA operators: The GA consists of three main operators named mutation, crossover, and selection. In the employed mutation operator, a variable of a chromosome is randomly selected, and its value is replaced by another randomly selected variable. In the crossover operator, two ranges of the CSs with the same length are randomly chosen, and their contents are exchanged. Finally, in the selection operator, elitism technique has been used, which forms the new population based on the most deserve chromosomes of the current population. In Figs. 3 and 4, the instances of the mutation and crossover operators are depicted, which describes, how the mentioned operators are applied to generating new CSs.
3. (c)
  Scoring the selected features: The proposed method is a wrapper method in which a learner evaluates the selected features. Due to the nature of the datasets, which are approximately class-balanced, we basically use the accuracy score (Eq. 1). Other criteria are also inspected in the experimental section.
  $$Score = Accuracy = \frac{TP + TN}{{TP + TN + FP + FN}}$$
  (1)
  where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. Because of their performance and reasonable time in generating a model, Support Vector Machines (SVMs) have been used for assessing the CSs. Considering the popularity and performance of SVMs, many libraries and packages have been developed accordingly. In this study, the LibSVM library, which is one of the most popular libraries with different options, was employed⁵⁸.
4. (d)
  Investigating the termination condition: when the value of the best CS is remained constant for 10 consequent iterations (generations), the GA is terminated, and all its CSs are passes to the WCC algorithm.
(ii)
Applying the proposed algorithm (the WCC): As mentioned before, at the end of the first step, GA passes the created CSs to the proposed algorithm (the flowchart of the WCC algorithm is shown in Fig. 5) and constitutes its first population of CSs. Next, WCC changes the CSs using its operators, which are explained and formulated as follows:
1. (a)
  Attacking operator: For a given CS, this operator selects some variables randomly and assigns them new values by chance from [1, n], where n is the total number of the existing features/genes. Equation 2 formulates the attacking operator:
  $$\mathop \sum \limits_{i = 1}^{k} \left[ {CS\left( r \right) = rand\left( n \right)} \right]$$
  (2)
  where CS, n, and k are a given candidate solution, the total number of features, and an integer random value between 1 and n, respectively. In other words, the k parameter determines how many variables of a CS must be changed. Further, the sigma sign denotes a loop, and r is an integer value between 1 and n as is k. Here, is an example of the attacking operator in Fig. 6.
2. (b)
  Transferring operator: Based on the scores (classification accuracy using a given CS), this operator selects several CSs with the highest score (Selected_CS), and then, chooses randomly some values (features) from them. Next, for a given CS, this operator imports the selected values. Equation 3 formulates the mentioned steps. Figure 7 describes the transferring operator in detail.
  $$\mathop \sum \limits_{j = 1}^{R} \mathop \sum \limits_{i = 1}^{k} \left[ {CS\left( r \right) = selected_{CSm} \left( {rand\left( l \right)} \right)} \right]$$
  (3)
  where $l$, R, and m are the length of the selected_CS, a random integer value between 1 and the total number of selected CSs, and an index which shows the randomly selected_CS, respectively. Further, other parameters have been described in Eq. 2.
3. (c)
  Passing operator: While the transferring and attacking operators may result in large changes in a CS, this operator guarantees low modifications. For this purpose, the operator only selects a variable by chance and changes its value. Equation 4, whose parameters are explained in Eq. 2, formulates the passing operator.
  $$CS\left( r \right) = rand\left( n \right)$$
  (4)
  
  Figure 8 illustrates an example of the passing operator and explains how the operator can be applied to the FS problem.
  
  Each of the changes induced by the operators will be accepted if they increase the accuracy score. Further, repeated features may appear by applying the operators. In these situations, only one of the repeated features is kept and all others are removed. Hence, the length of CSs may vary.
(d)
Investigating the termination conditions: For terminating the algorithms, several options (e.g., predefined number of iterations, time, accuracy, etc.) can be used. In the present study, two different strategies are chosen for terminating the algorithm. As mentioned before, when the value of accuracy remains about constant in the last ten iterations, the GA is finished. For the WCC algorithm, a predetermined number of iterations has been considered as the termination condition.

Results

To obtain results, a computer system with a dual-core 2.2 GH processor and 12 GB of RAM was employed. Further, our designed FeatureSelect software application and MATLAB programming language were used for the implementations. In this section, all the obtained outcomes refer to results from the five-fold cross-validation technique. For comparing the algorithms and methods, the same conditions were considered. For example, GA, WCC algorithm, and GA_WCC method allowed to run for an identical time for getting the results. The size of populations for the GA, WCC algorithm, and GA_WCC method was determined using a “trial and error” method and their time-consuming parameter, in which the best performance of the algorithms is observed. Based on the outcomes, the population sizes were considered 100, 20, and 100 for the GA, WCC algorithm, and GA_WCC method, respectively. The mutation and crossover rates were set to 30%, because the GA shows a suitable behavior based on them. In addition to the population size parameters, the WCC algorithm consists of the match time (the total number of attempts to change a CS) parameter, which has been set to 2. This parameter was initiated 1 to the GA_WCC method. The outcomes (which encompassed the results of five popular filter FS methods, GA, WCC, a two-step filter-wrapper method (EN_WCC), and the proposed wrapper-wrapper method (GA_WCC)), were divided into the following three categories:

(i)
The first category of the results: This class consists of the results obtained from applying the mentioned algorithms and methods to the datasets having more than 50 features and relating to the classification type. Tables 2 and 3 represent the attained outcomes. Also, Fig. 9 depicts the results of the SVM without applying the FS algorithms on the investigated datasets.

Wrapper-based FS methods improve the performance of SVM, whereas Filter-based FS approaches may reduce its performance. Overall, among the filter methods, the entropy-based (EN) FS method has led to more appropriate results than others. Moreover, between GA and WCC algorithms, WCC yields better outcomes. Hence, a combination of EN and WCC (the so-called EN_WCC) is also investigated and compared against the others. For the Cancer dataset, GA_WCC, GA, and WCC have yielded the best solutions. However, GA_WCC and GA classify the data with six features, whereas WCC classifies them with ten attributes. For the Arrhythmia dataset, the proposed approach outperforms others in terms of the total number of features (NOF) and other classification criteria. For the Diabetes dataset, EN_WCC yielded a minimum number of features and have yielded better outcomes than the filter methods, as observed for the cancer dataset. Nevertheless, the data of GA_WCC, WCC, and GA surpass EN_WCC. Similar outcomes are observed for the other datasets. Tables 2 and 3 show that wrapper and two-step methods are more efficient than the filter ones, and their performance can be sorted as GA_WCC, WCC, GA, and EN_WCC, respectively.

For further evaluating the methods, receiving operating characteristic (ROC) curves of the methods are shown in Figs. 10 and 11. The area under the curve (AUC) values of the approaches on the datasets of the first class of the outcomes are shown in Table 4. The two-step and wrapper approaches have remarkable functionality compared to the others, and the proposed method outperforms all of them (Figs. 10, 11, Tables 2, 3, and 4). In another evaluation of the algorithm’s performance, the p-value (PV) measurement was considered (Table 5). To this end, every algorithm was performed in 50 individual executions, and the results of the proposed method (GA_WCC) were considered as a test base. Next, the outcomes of the other algorithms were compared with them. Except for the Cancer dataset, in which the effectiveness of the algorithms is the same, the proposed method has outperformed the others for the remaining datasets. Figure 12 also presents boxplots of the algorithms’ outputs obtained using One-Way ANOVA test. Every execution consists of 100 iterations of the algorithms step. At the end of an iteration, the best acquired accuracy was stored, and the convergence behavior of the algorithms were investigated for the datasets including more than 1000 features (Fig. 13). It was observed that the convergence speed of the proposed method is higher than the GA and WCC algorithms (without merging them). As mentioned before, the combined method can efficiently address the limitations of the GA and WCC algorithm (the low convergence of the GA algorithm and the restricted number of CSs in the WCC) and yield better outcomes when combined than when run individually.

In filter FS methods, determining the total number of features is a challenging problem and plays an essential role in the performance of a model. The results of the five filter approaches are shown in Figs. 14, 15, 16, and 17. These outcomes show the performance of the filter FS methods with a different number of features.
(ii)
The second category of results: This section includes the results of the algorithms on the datasets having less than 50 features/attributes. The main goal of this section is to check the effect of FS methods on datasets, which consist of fewer numbers of features. For the small datasets, single wrapper methods do not face special challenges in the FS. Indeed, the mentioned FS methods may obtain the best solution by improving the run time. Hence, in this section, the functionality of the GA and WCC algorithms are inspected. Like for the first part, criteria such as sensitivity, specificity, accuracy, precision, and AUC were investigated. The acquired data are listed in Table 6.

Without applying the GA and WCC algorithms, SVM alone yields 0.5263, 0.6645, and 0.5812 value of accuracy using the fivefold cross-validation technique on the CHD, SHD, and PID datasets, respectively. By applying the algorithms, the value of accuracy improved for the CHD and SHD datasets and remains unchanged for the PID dataset. Further, the total number of features is remarkably reduced. Thus, the obtained models obtained by applying the algorithms operate faster than the model, which uses all the existing features. Having compared GA and WCC algorithms, WCC was seen to lead to a model with lower number of features and higher values of criteria. Therefore, it might be concluded that the state-of-the-art optimization algorithm can get more acceptable data than others.
(iii)
The third category of the results: In this section, the outcomes of the methods and algorithms are evaluated on the regression-based dataset (WDBC and drug datasets). To this end, the criteria such as root mean squared error (RMSE) and the correlation between predicted and real labels were calculated and gathered (Table 7). For the filter FS methods, different numbers of features have been tested, and then, their best results were reported. For the wrapper FS approaches, it is not necessary to limit the total number of features and they can regulate it. Even so, they produce variable results in their different executions, so they must be executed at least 30 times and their best-obtained outcomes among from the executions (different accuracy values of the executions) are reported as a solution to the problem. Thus, several criteria were reported for them, based on the acquired results in 50 individual executions, including confidence interval (CI), p-value, standard deviation (STD), etc.

Table 2 Acquired outcomes based on fivefold cross-validation in the first class of results on the Cancer, Arrythmia, Diabetes, and Lung datasets.

Full size table

Table 3 Acquired outcomes based on fivefold cross-validation in the first class of results on the QSAR, Arcene, MicroMass, and RNA-Seq datasets.

Full size table

Table 4 The AUC values of the methods in the first category of the results.

Full size table

Table 5 A comparison of the obtained results based on the p-value criterion.

Full size table

Table 6 Results based on fivefold cross-validation in the second class of obtained data.

Full size table

Table 7 Comparison of the methods on the regression-based datasets.

Full size table

From the run-time perspective, filter FS methods require less time than wrapper approaches, but do not result in improved outcomes. For instance, for the WDBC dataset, the entropy FS approach yields the minimum value of error and the maximum value of correlation between the predicted and real labels, when the total number of features is limited to 13. The value of correlation can be calculated not only for the entropy method but also for others. As the first class of results, the second one also shows the remarkable performance of the proposed approach (GA_WCC) in terms of error, correlation, the total number of selected features, run-time, etc. Besides, WCC and GA present that wrapper FS method may acquire better results than the filter FS approaches. In Fig. 18, the scatter plots of the proposed method on the regression-based datasets are shown.

Discussion

Many methods and algorithms have been proposed for selecting an optimal subset of features, which is indeed an NP-hard problem, particularly in machine learning with a biological context. Besides enhancing the separability power of a model, optimal features improve the speed of a model and may lead to valuable results such as acquiring an optimal kit of biomarkers to be used in applications. In this area, it has been shown that two-step FS approaches lead to better outcomes than single methods⁵⁹, and wrapper-based FS methods usually outperform filter and embedded FS techniques⁶⁰. The results of this study also confirm the mentioned observations and allow for the following important key conclusions:

First, wrapper FS methods may obtain an optimal subset of features, which do not require confining the total number of features to a predefined number. Nevertheless, there are some restrictions in determining the total number of selected features. For example, wrapper methods may obtain a subset of attributes with the highest score, while the total number of the selected features may be greater than the required number of features (problem limitations). In this line, we believe that wrapper FS methods are still better than the filter and embedded FS approaches, in large part because they can be formulated in a way to resolve the problem constraints.

Second, limiting the filter methods to a predefined number is a challenging problem and affects the performance of filter FS approaches. The results of this work show that the performance of filter FS approaches vary with the different number of selected features. Thus, this parameter remains a challenge for researchers. However, wrapper methods, which consider a set of features instead of examining each of them separately, do not face this restriction.

Third, the FS is also essential for datasets having a low number of features. In the second part of the results, the performance of wrapper FS methods was investigated on some gold-standard datasets, for which their total number of features is less than 50. Based on other conducted studies⁶¹, it seems that the FS has been ignored in these works even though it may improve the performance. For this class of datasets, considering the total number of features, single wrapper methods might be a proper method.

Forth, wrapper-wrapper FS methods may be the best option for selecting an optimal subset of features. In the last decade, different types of hybrid methods have been introduced for the FS problem due to their amazing results. However, most of them combine filter-filter or filter-wrapper approaches and a suitable configuration of wrapper-wrapper methods have been ignored. In the present investigation, a wrapper-wrapper approach based on GA and the proposed WCC-algorithm was introduced, which resulted in superior outcomes compared to the other approaches. The WCC algorithm starts with a first population of CSs and, then, applies its operators to them in order to obtain a better solution to the FS problem. The main difference between the WCC algorithm and other optimization algorithms relates to the steps of the algorithm and its operators. The two-step approaches differ from hybrid methods that merge the optimization algorithms such as the whale optimization algorithm and simulated annealing⁶². In this study, to obtain an efficient combination of the algorithms, the advantages and limitations of the GA and WCC algorithm were considered. Since the GA produces various CSs, the WCC algorithm confines them to a limited number. Unlike the WCC algorithm, the GA may suffer from low convergence speed and not show a suitable performance relative to other optimization algorithms. Given the mentioned reasons, GA and WCC algorithm were combined, and the results showed that their combination yields better outcomes.

Fifth, the performance of algorithms and methods varies on different datasets. Every algorithm or method has its own attitude relative to the FS problem, so their functionality may differ on various data. Generally, it is impossible to predict a priori, which of the methods or algorithms is suitable for a given problem. Nonetheless, wrapper-wrapper FS approaches appear promising to produce desired results. As a future work, the proposed method can be applied to other algorithms such as the Salp Swarm Algorithm⁶³ and DE⁶⁴ with considering limitations and disadvantages. Also, the proposed method scores a set of features and does not rank the features of the obtained set. To address this limitation, the proposed approach can be combined with state-of-the-art ranking techniques such as SVM-RFE^65,66.

Conclusion

For selecting an optimal subset of features, a two-step wrapper-wrapper FS method based on GA and our proposed algorithm (WCC) was introduced and applied to the thirteen biological datasets with different properties. In comparison with other approaches, it can be concluded that two-step techniques may lead to better results than single-step methods. Furthermore, among the two-step approaches, wrapper-wrapper FS methods may be more appropriate than others. For biological applications, it seems that wrapper approaches are the most convenient and reliable method, in large part because they do not need to be restricted to a predefined number of features. Taken together, based on our findings, wrapper-wrapper FS methods can be used to optimize the FS problems and result in robust and desired outcomes.

References

Ghosh, M., Begum, S., Sarkar, R., Chakraborty, D. & Maulik, U. Recursive memetic algorithm for gene selection in microarray data. Expert Syst. Appl. 116, 172–185 (2019).
Article Google Scholar
Barnabas, G. D. et al. Microvesicle proteomic profiling of uterine liquid biopsy for ovarian cancer early detection. Mol. Cell. Proteomics 18, 865–875 (2019).
Article CAS PubMed PubMed Central Google Scholar
Walther, D., Strassburg, K., Durek, P. & Kopka, J. Metabolic pathway relationships revealed by an integrative analysis of the transcriptional and metabolic temperature stress-response dynamics in yeast. Omics J. Integr. Biol. 14, 261–274 (2010).
Article CAS Google Scholar
Frankell, A. M. et al. The landscape of selection in 551 esophageal adenocarcinomas defines genomic biomarkers for the clinic. Nat. Genet. 51, 506–516 (2019).
Article CAS PubMed PubMed Central Google Scholar
Long, N. P. et al. Efficacy of integrating a novel 16-gene biomarker panel and intelligence classifiers for differential diagnosis of rheumatoid arthritis and osteoarthritis. J. Clin. Med. 8, 50 (2019).
Article CAS PubMed Central Google Scholar
MotieGhader, H., Masoudi-Sobhanzadeh, Y., Ashtiani, S. H. & Masoudi-Nejad, A. mRNA and microRNA selection for breast cancer molecular subtype stratification using meta-heuristic based algorithms. Genomics 112, 3207–3217 (2020).
Article CAS PubMed Google Scholar
Adeli, E., Li, X., Kwon, D., Zhang, Y. & Pohl, K. M. Logistic regression confined by cardinality-constrained sample and feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 42, 1713–1728 (2019).
Article PubMed PubMed Central Google Scholar
Salama, M. A. & Hassan, G. A Novel Feature Selection Measure Partnership-Gain. Int. J. Online Biomed. Eng. 15 (2019).
Li, F. et al. Positive-unlabelled learning of glycosylation sites in the human proteome. BMC Bioinf. 20, 1–17 (2019).
Google Scholar
Abdel-Basset, M., El-Shahat, D., El-henawy, I., de Albuquerque, V. H. C. & Mirjalili, S. A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection. Expert Syst. Appl. 139, 112824 (2020).
Article Google Scholar
Sayed, G. I., Hassanien, A. E. & Azar, A. T. Feature selection via a novel chaotic crow search algorithm. Neural Comput. Appl. 31, 171–188 (2019).
Article Google Scholar
Masoudi-Sobhanzadeh, Y., Motieghader, H. & Masoudi-Nejad, A. FeatureSelect: a software for feature selection based on machine learning approaches. BMC Bioinf. 20, 170 (2019).
Article Google Scholar
Masoudi-Sobhanzadeh, Y., Omidi, Y., Amanlou, M. & Masoudi-Nejad, A. Trader as a new optimization algorithm predicts drug-target interactions efficiently. Sci. Rep. 9, 1–14 (2019).
Article CAS Google Scholar
Masoudi-Sobhanzadeh, Y., Omidi, Y., Amanlou, M. & Masoudi-Nejad, A. DrugR+: A comprehensive relational database for drug repurposing, combination therapy, and replacement therapy. Comput. Biol. Med. 109, 254–262 (2019).
Article CAS PubMed Google Scholar
Rao, H. et al. Feature selection based on artificial bee colony and gradient boosting decision tree. Appl. Soft Comput. 74, 634–642 (2019).
Article Google Scholar
Gronsbell, J., Minnier, J., Yu, S., Liao, K. & Cai, T. Automated feature selection of predictors in electronic medical records data. Biometrics 75, 268–277 (2019).
Article MathSciNet PubMed MATH Google Scholar
Masoudi-Sobhanzadeh, Y. & Motieghader, H. World Competitive Contests (WCC) algorithm: A novel intelligent optimization algorithm for biological and non-biological problems. Inf. Med. Unlocked 3, 15–28 (2016).
Article Google Scholar
Mafarja, M. M. & Mirjalili, S. Hybrid binary ant lion optimizer with rough set and approximate entropy reducts for feature selection. Soft. Comput. 23, 6249–6265 (2019).
Article Google Scholar
Rahmaninia, M. & Moradi, P. OSFSMI: online stream feature selection method based on mutual information. Appl. Soft Comput. 68, 733–746 (2018).
Article Google Scholar
Saqlain, S. M. et al. Fisher score and Matthews correlation coefficient-based feature subset selection for heart disease diagnosis using support vector machines. Knowl. Inf. Syst. 58, 139–167 (2019).
Article Google Scholar
Koprinska, I., Rana, M. & Agelidis, V. G. Correlation and instance based feature selection for electricity load forecasting. Knowl.-Based Syst. 82, 29–40 (2015).
Article Google Scholar
Si, L., Wang, Z., Tan, C. & Liu, X. A feature extraction method based on composite multi-scale permutation entropy and Laplacian score for shearer cutting state recognition. Measurement 145, 84–93 (2019).
Article ADS Google Scholar
Pournoor, E., Elmi, N., Masoudi-Sobhanzadeh, Y. & Masoudi-Nejad, A. Disease global behavior: a systematic study of the human interactome network reveals conserved topological features among categories of diseases. Inf. Med. Unlocked 17, 100249 (2019).
Article Google Scholar
Shukla, A. K., Singh, P. & Vardhan, M. A new hybrid wrapper TLBO and SA with SVM approach for gene expression data. Inf. Sci. 503, 238–254 (2019).
Article MathSciNet Google Scholar
Jiang, S., Chin, K.-S., Wang, L., Qu, G. & Tsui, K. L. Modified genetic algorithm-based feature selection combined with pre-trained deep neural network for demand forecasting in outpatient department. Expert Syst. Appl. 82, 216–230 (2017).
Article CAS Google Scholar
Ruggieri, S. Complete search for feature selection in decision trees. J. Mach. Learn. Res. 20, 1–34 (2019).
MathSciNet MATH Google Scholar
Pashaei, E., Pashaei, E. & Aydin, N. Gene selection using hybrid binary black hole algorithm and modified binary particle swarm optimization. Genomics 111, 669–686 (2019).
Article CAS PubMed Google Scholar
Ali, W. & Ahmed, A. A. Hybrid intelligent phishing website prediction using deep neural networks with genetic algorithm-based feature selection and weighting. IET Inf. Secur. 13, 659–669 (2019).
Article Google Scholar
Sprenger, H. et al. Metabolite and transcript markers for the prediction of potato drought tolerance. Plant Biotechnol. J. 16, 939–950 (2018).
Article CAS PubMed Google Scholar
Mafarja, M. & Mirjalili, S. Whale optimization approaches for wrapper feature selection. Appl. Soft Comput. 62, 441–453 (2018).
Article Google Scholar
Masoudi-Sobhanzadeh, Y. & Masoudi-Nejad, A. Synthetic repurposing of drugs against hypertension: a datamining method based on association rules and a novel discrete algorithm. BMC Bioinf. 21, 1–21 (2020).
Article Google Scholar
Faramarzi, A., Heidarinejad, M., Stephens, B. & Mirjalili, S. Equilibrium optimizer: A novel optimization algorithm. Knowl.-Based Syst. 191, 105190 (2020).
Article Google Scholar
Katuwal, R., Suganthan, P. N. & Zhang, L. An ensemble of decision trees with random vector functional link networks for multi-class classification. Appl. Soft Comput. 70, 1146–1153 (2018).
Article Google Scholar
Lopes, M. B. et al. Ensemble outlier detection and gene selection in triple-negative breast cancer data. BMC Bioinf. 19, 1–15 (2018).
Article CAS Google Scholar
Dimitriadis, S. I., Liparas, D., Tsolaki, M. N. & Initiative, A. s. D. N. Random forest feature selection, fusion and ensemble strategy: Combining multiple morphological MRI measures to discriminate among healhy elderly, MCI, cMCI and alzheimer’s disease patients: From the alzheimer’s disease neuroimaging initiative (ADNI) database. J. Neurosci. Methods 302, 14–23 (2018).
Article CAS PubMed Google Scholar
MotieGhader, H., Gharaghani, S., Masoudi-Sobhanzadeh, Y. & Masoudi-Nejad, A. Sequential and mixed genetic algorithm and learning automata (SGALA, MGALA) for feature selection in QSAR. IJPR 16, 533 (2017).
CAS PubMed PubMed Central Google Scholar
Khan, M. A. et al. An optimized method for segmentation and classification of apple diseases based on strong correlation and genetic algorithm based feature selection. IEEE Access 7, 46261–46277 (2019).
Article Google Scholar
Xue, X., Li, C., Cao, S., Sun, J. & Liu, L. Fault diagnosis of rolling element bearings with a two-step scheme based on permutation entropy and random forests. Entropy 21, 96 (2019).
Article ADS PubMed Central Google Scholar
Wang, M. & Barbu, A. Are screening methods useful in feature selection? An empirical study. PloS ONE 14, e0220842 (2019).
Article CAS PubMed PubMed Central Google Scholar
Corrales, D. C., Lasso, E., Ledezma, A. & Corrales, J. C. Feature selection for classification tasks: Expert knowledge or traditional methods?. J. Intell. Fuzzy Syst. 34, 2825–2835 (2018).
Article Google Scholar
Urbanowicz, R. J., Meeker, M., La Cava, W., Olson, R. S. & Moore, J. H. Relief-based feature selection: Introduction and review. J. Biomed. Inform. 85, 189–203 (2018).
Article PubMed PubMed Central Google Scholar
Brahim, A. B. & Limam, M. Ensemble feature selection for high dimensional data: a new method and a comparative study. Adv. Data Anal. Classif. 12, 937–952 (2018).
Article MathSciNet MATH Google Scholar
Wah, Y. B., Ibrahim, N., Hamid, H. A., Abdul-Rahman, S. & Fong, S. Feature selection methods: case of filter and wrapper approaches for maximising classification accuracy. Pertanika J. Sci. Technol. 26 (2018).
Jović, A., Brkić, K. & Bogunović, N. In 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). 1200–1205 (Ieee).
Asuncion, A. & Newman, D. (Irvine, CA, USA, 2007)
Haghjoo, N. & Masoudi-Nejad, A. Introducing a panel for early detection of lung adenocarcinoma by using data integration of genomics, epigenomics, transcriptomics and proteomics. Exp. Mol. Pathol. 112, 104360 (2020).
Article CAS PubMed Google Scholar
47Bulaghi, Z. A., Navin, A. H., Hosseinzadeh, M. & Rezaee, A. World competitive contest-based artificial neural network: A new class-specific method for classification of clinical and biological datasets. Genomics (2020).
48Frank, A. UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml (2010).
Grisoni, F., Consonni, V. & Ballabio, D. Machine learning consensus to predict the binding to the androgen receptor within the CoMPARA project. J. Chem. Inf. Model. 59, 1839–1848 (2019).
Article CAS PubMed Google Scholar
50Guyon, I., Gunn, S. R., Ben-Hur, A. & Dror, G. in NIPS, 545–552.
Mahe, P. et al. Automatic identification of mixed bacterial species fingerprints in a MALDI-TOF mass-spectrum. Bioinformatics 30, 1280–1286 (2014).
Article CAS PubMed Google Scholar
Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Article PubMed PubMed Central CAS Google Scholar
53Shi, Y. & Eberhart, R. C. in Proceedings of the 1999 congress on evolutionary computation-CEC99 (Cat. No. 99TH8406). 1945–1950 (IEEE).
Azad-Farsani, E., Zare, M., Azizipanah-Abarghooee, R. & Askarian-Abyaneh, H. A new hybrid CPSO-TLBO optimization algorithm for distribution network reconfiguration. J. Intell. Fuzzy Syst. 26, 2175–2184 (2014).
Article MATH Google Scholar
Ghaemi, M. & Feizi-Derakhshi, M.-R. Forest optimization algorithm. Expert Syst. Appl. 41, 6676–6687 (2014).
Article Google Scholar
Dong, H., Li, T., Ding, R. & Sun, J. A novel hybrid genetic algorithm with granular information for feature selection and optimization. Appl. Soft Comput. 65, 33–46 (2018).
Article Google Scholar
Liu, X.-Y., Liang, Y., Wang, S., Yang, Z.-Y. & Ye, H.-S. A hybrid genetic algorithm with wrapper-embedded approaches for feature selection. IEEE Access 6, 22863–22874 (2018).
Article Google Scholar
Chang, C.-C. & Lin, C.-J. LIBSVM: a library for support vector machines. (TIST) 2, 1–27 (2011).
Google Scholar
Ge, R. et al. McTwo: a two-step feature selection algorithm based on maximal information coefficient. BMC Bioinf. 17, 1–14 (2016).
Article CAS Google Scholar
Xue, X., Yao, M. & Wu, Z. A novel ensemble-based wrapper method for feature selection using extreme learning machine and genetic algorithm. Knowl. Inf. Syst. 57, 389–412 (2018).
Article Google Scholar
Nahato, K. B., Nehemiah, K. H. & Kannan, A. Hybrid approach using fuzzy sets and extreme learning machine for classifying clinical datasets. Inf. Med. Unlocked 2, 1–11 (2016).
Article Google Scholar
Mafarja, M. M. & Mirjalili, S. Hybrid whale optimization algorithm with simulated annealing for feature selection. Neurocomputing 260, 302–312 (2017).
Article Google Scholar
Mirjalili, S. et al. Salp Swarm Algorithm: a bio-inspired optimizer for engineering design problems. Adv. Eng. Softw. 114, 163–191 (2017).
Article Google Scholar
Karaboğa, D. & Ökdem, S. A simple and global optimization algorithm for engineering problems: differential evolution algorithm. Turk. J. Electr. Eng. Comput. Sci. 12, 53–60 (2004).
Google Scholar
Mundra, P. A. & Rajapakse, J. C. SVM-RFE with MRMR filter for gene selection. IEEE Trans. Nanobiosci. 9, 31–37 (2009).
Article Google Scholar
Duan, K.-B., Rajapakse, J. C., Wang, H. & Azuaje, F. Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans. Nanobiosci. 4, 228–234 (2005).
Article Google Scholar

Download references

Acknowledgements

The authors would like to appreciate Iranian National Science Founding (INSF) for their supports.

Author information

Authors and Affiliations

Research Center for Pharmaceutical Nanotechnology, Biomedicine Institute, Tabriz University of Medical Sciences, Tabriz, Iran
Yosef Masoudi-Sobhanzadeh
Department of Bioinformatics, Biotechnology Research Center, Tabriz Branch, Islamic Azad University, Tabriz, Iran
Habib Motieghader
Department of Basic Sciences, Gowgan Educational Center, Tabriz Branch, Islamic Azad University, Tabriz, Iran
Habib Motieghader
Department of Pharmaceutical Sciences, College of Pharmacy, Nova Southeastern University, Fort Lauderdale, Florida, 33328, USA
Yadollah Omidi
Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
Ali Masoudi-Nejad

Authors

Yosef Masoudi-Sobhanzadeh
View author publications
You can also search for this author in PubMed Google Scholar
Habib Motieghader
View author publications
You can also search for this author in PubMed Google Scholar
Yadollah Omidi
View author publications
You can also search for this author in PubMed Google Scholar
Ali Masoudi-Nejad
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.M.-S.: Conceptualization, implementation, formal analysis, investigation, writing, editing, and revising the manuscript. H.M. Validation, data analysis, Editing-manuscript. Y.O.: Results analysis, validation, Conceptualization, writing, editing, and revising the manuscript. A.M.-N.: Conceptualization, Supervision, Project administration, writing, editing, and revising the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Ali Masoudi-Nejad.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Masoudi-Sobhanzadeh, Y., Motieghader, H., Omidi, Y. et al. A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications. Sci Rep 11, 3349 (2021). https://doi.org/10.1038/s41598-021-82796-y

Download citation

Received: 09 August 2020
Accepted: 25 January 2021
Published: 08 February 2021
DOI: https://doi.org/10.1038/s41598-021-82796-y
Springer Nature Limited

This article is cited by

Gene selection for high dimensional biological datasets using hybrid island binary artificial bee colony with chaos game optimization
- Maha Nssibi
- Ghaith Manita
- Ouajdi Korbaa
Artificial Intelligence Review (2024)
A voting-based machine learning approach for classifying biological and clinical datasets
- Negar Hossein-Nezhad Daneshvar
- Yosef Masoudi-Sobhanzadeh
- Yadollah Omidi
BMC Bioinformatics (2023)
Environmental Remediation of Agrochemicals and Dyes Using Clay Nanocomposites: Review on Operating Conditions, Performance Evaluation, and Machine Learning Applications
- Subrajit Bosu
- Natarajan Rajamohan
- Yasser Vasseghian
Reviews of Environmental Contamination and Toxicology (2023)
Applications of Machine Learning to Predict the Chord Length Distribution of Droplets in Oil–Water Dispersions
- Yunchao Li
- Daqian Liu
- Lu Liu
JOM (2022)

A machine learning method based on the genetic and world competitive contests algorithms for selecting genes or features in biological applications

Abstract

Similar content being viewed by others

FeatureSelect: a software for feature selection based on machine learning approaches

Feature selection techniques for microarray datasets: a comprehensive review, taxonomy, and future directions

An Optimize Gene Selection Approach for Cancer Classification Using Hybrid Feature Selection Methods

Introduction

Related works