Abstract
The widespread applications in microarray technology have produced the vast quantity of publicly available gene expression datasets. However, analysis of gene expression data using biostatistics and machine learning approaches is a challenging task due to (1) high noise; (2) small sample size with high dimensionality; (3) batch effects and (4) low reproducibility of significant biomarkers. These issues reveal the complexity of gene expression data, thus significantly obstructing microarray technology in clinical applications. The integrative analysis offers an opportunity to address these issues and provides a more comprehensive understanding of the biological systems, but current methods have several limitations. This work leverages state of the art machine learning development for multiple gene expression datasets integration, classification and identification of significant biomarkers. We design a novel integrative framework, MVIAm - Multi-View based Integrative Analysis of microarray data for identifying biomarkers. It applies multiple cross-platform normalization methods to aggregate multiple datasets into a multi-view dataset and utilizes a robust learning mechanism Multi-View Self-Paced Learning (MVSPL) for gene selection in cancer classification problems. We demonstrate the capabilities of MVIAm using simulated data and studies of breast cancer and lung cancer, it can be applied flexibly and is an effective tool for facing the four challenges of gene expression data analysis. Our proposed model makes microarray integrative analysis more systematic and expands its range of applications.
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Introduction
Microarray technology is one of the most recent advances being used for cancer research, which can measure the expression levels of many thousands or tens of thousands of genes simultaneously. With the rapid development of microarray technology, many database repositories of high throughput gene expression data have been created and published for researchers to use, Gene Expression Omnibus (GEO), for example, currently have stored more than 2.76 million samples over 105,000 studies1. The use of gene expression datasets to discover highly reliable biomarkers is an important goal in clinical applications. The significant biomarkers can help researchers to detect the disease in individuals, classify the type of disease, predict the response of therapy and so on2.
Analysis of gene expression data using biostatistics and machine learning approaches is facing four major challenges: (1) High noise: Random noise and systematic biases exist in gene expression data not only impact the scientific validity and costs of studies but also disrupts accurate prediction of phenotype that may ultimately impact patients3,4. (2) Small sample size with high dimensionality: The gene expression dataset generally contains a large number of genes and small size of samples, which called large p & small n problem5. Only a small fraction of genes are closely relevant to the target disease, and most genes are irrelevant6. From a machine learning perspective, numerous irrelevant genes may introduce noise and reduce the performance of the classifier7,8. (3) Batch effects: It occurs because measurements are affected by many factors including experiments principle, data collection standards, and personnel differences. The systematic noise introduced when samples are processed in multiple batches have a detrimental effect on data derived from microarrays9,10. (4) Low reproducibility of significant biomarkers: The published significant biomarkers from internal validation rarely overlap with other research groups11. These four issues reveal the complexity of gene expression data, which constrains the development of microarray technology in clinical applications.
To face these challenges and take advantage of multiple published gene expression datasets, the integrative analysis of gene expression data has become an effective tool by aggregating multiple datasets and increasing the statistical power in identifying a small subset of genes to effectively predict the type of the disease12,13. Current microarray integrative analysis was first proposed by Hamid et al.14, basically classified into “late stage” data integration and “early stage” data integration. However, current methods for microarray integrative analysis have several limitations. Most “late stage” data integration methods identify genes based on combining univariate summary statistics, such as p-value15, effect size16 and rank aggregation12,17. As a result, it is difficult to identify non-redundant significant genes and systematically determine (e.g. cross-validation) how many genes to include in the subset, such as GeneMeta18 and metaArray19. Moreover, such methods neglect correlations among genes and do not eliminate the batch effects between different datasets. Current “early stage” data integration methods usually apply one cross-platform normalization method to aggregate multiple datasets into a single unified large dataset. After that, classification and variable selection for the merged dataset can be achieved by the machine learning methods. For example, Ma et al.20 proposed the meta threshold gradient descent regularization (MTGDR) for gene selection in the integrative analysis of gene expression data. Meta-lasso method was published by Li et al.21, which not only boosts the statistic power to identify significant genes but also keeps the flexibility of gene selection. Recently, Hughey et al.22 developed integrative analysis using elastic net penalized with logistic regression model (LEN), a powerful and versatile method for variable selection in classification. Special emphasis, cross-platform normalization is an essential part of the “early stage” data integration, because it can eliminate the differences between datasets from different microarray platforms while preserving underlying the differences in biology23. A number of cross-platform normalization methods have been developed and provide effective batch adjustment for microarray data, such as ComBat24, cross-platform normalization (XPN) method25, and batch effects removal (ber)26. However, different cross-platform normalization methods are based on different statistical models with different accuracy, precision and overall effectiveness27. Current “early stage” data integration methods usually apply one cross-platform normalization method, which cannot ensure maximum elimination of the batch effects. Beyond that, none of these integrative analysis methods have a robust learning mechanism to minimize the influence of the noise. Therefore, there is a crucial need for a novel integrative analysis method for robust analysis of the microarray data, prediction of cancer types and identification of significant biomarkers.
We design a novel integrative framework called MVIAm (Multi-View based Integrative Analysis of microarray data for identifying biomarkers). MVIAm can be divided into three phases: pre-processing each dataset, aggregation and generate multi-view data, and analysis of multi-view data. MVIAm aggregates multiple microarray gene expression datasets through different cross-platform normalization methods and generates multiple aggregated gene expression datasets. Each aggregated dataset has the same set of samples and features but is generated by the different statistical models, which belongs to one type of multi-view data28. The novel integrative framework MVIAm extends the traditional “early” stage data integration to multi-view data integration. Generally, multi-view data contains complementary information and has more comprehensive information than those of single-view data29. In recent years, several multi-view machine learning methods for integrating multi-view data have been developed28,30. The supervised multi-view data integration methods generally include concatenation-based and ensemble-based integration31. MVIAm enables more multi-view machine learning methods for supervised homogeneous data integration. The multi-view gene expression data generated by MVIAm has the following characteristics:
Multi-view data generated by MVIAm can significantly increase the sample size, which greatly alleviates large p & n problem and increase the statistical power in identifying biomarkers.
-
Multi-view data typically contains complementary information and has more comprehensive understanding of the biological systems.
-
The batch effects cannot be completely eliminated, meaning that each view of the data still has different types of bias.
Although quality control and different cross-platform normalization methods are used to process gene expression data, it is inevitable that the data has noises and biases. In the phase of analyzing gene expression data, in order to alleviate the impact of the noise on the learning process and take advantage of significantly increased data, we introduce a robust learning mechanism called self-paced learning32. Self-paced learning (SPL) is a typical sample reweighting method, especially used in high noise situations33. It was proposed based on the core idea of curriculum learning34. Curriculum learning (CL) is inspired by human learning and is learned by gradually including samples from easy to complex into the training process. SPL embeds curriculum design as a regularization term into the learning objective, automatically select samples into training from easy to complex in a purely self-paced way. Due to its generality and generalization, SPL has been widely used in various tasks35,36,37,38. Moreover, Meng et al.39 have provided some new theoretical understanding of the SPL scheme, which helps us have a deep insight into it. To analysis multi-view gene expression data, we propose Multi-View Self-Paced Learning (MVSPL), a robust supervised multi-view data integration method. The main idea of MVSPL is to interactively recommend high-confidence samples with smaller loss values and automatically select samples from easy to complex to train the model for each view.
In summary, the main contributions of this work can be summarized as follows:
-
We design a novel framework of gene expression data integration called MVIAm, which can generate multi-view gene expression data based on different cross-platform normalization methods. Moreover, we propose a robust learning method MVSPL to analyze multi-view gene expression data for gene selection and cancer classification problem. It is an effective tool to address the challenges of microarray data analysis.
-
Experimental results on both simulation and real experiments substantiate the superiority of MVSPL as compared to a sparse logistic regression model with Lasso (L1), a sparse logistic regression model with elastic net (LEN), ensemble-based elastic net (Ensemble_EN) and SPL.
-
Our proposed model makes gene expression integrative analysis more systematic and expands the range of applications that an integrative analysis can be used to address.
Methods
The MVIAm integrative framework
Figure 1 shows the pipeline of the MVIAm, which aggregates multiple microarray datasets and identifies the significant biomarkers, assesses the prediction performance of the model. MVIAm can be divided into three phases: pre-processing each dataset, aggregation and generate multi-view data, and analysis of multi-view data.
Pre-processing each data set
The original Affymetrix data was first normalized and log-transformed by a robust multi-array average (RMA)40 method. After that, downloading and installing the appropriate custom chip definition files (CDFs) packages according to the type of microarray platform. The CDF package is necessary for probe annotation for Affymetrix data. The probes of the normalized data can be successfully mapped to Entrez Gene IDs by annotation packages in Bioconductor41. If multiple probes match a single Entrez ID, we calculated the median of values of those probes as the expression value for this gene.
Aggregation and generate multi-view data
One challenge of microarray integrative analysis is that each gene expression dataset may have gene expression values for slightly different sets of genes. Commonly method, the common genes from all gene expression datasets are extracted as the merged set of genes. After that, MVIAm utilizes different cross-platform normalization methods to process the gene expression dataset to eliminate the batch effects. In this work, we use two cross-platform normalization methods to eliminate the batch effects, ComBat24 and ber26. ComBat is an Empirical Bayes method, includes two methods, a parametric prior method (ComBat_p) and a non-parametric method (ComBat_n), based on the prior distributions of the estimated parameters. Ber, removes batch effects by using a two-stage regression approach, includes two methods, with bagging method (ber_bg) and without bagging method (ber).
Multi-view self-paced learning (MVSPL)
Here, we detailed introduce the proposed multi-view self-paced learning (MVSPL) model, which extends the self-paced learning35 model to multi-view scenarios. The fundamental concept of SPL please see the part of related work. Suppose given a dataset with multiple views \(D=\{({X}_{1}^{(j)},{y}_{1}),({X}_{2}^{(j)},{y}_{2}),\ldots ,({X}_{n}^{(j)},{y}_{n})\}\), where \({X}_{i}^{(j)}=({x}_{i1}^{(j)},{x}_{i2}^{(j)},\ldots ,{x}_{ip}^{(j)})\) is the i-th input sample with p features under the j-th view and yi is the i-th sample with the value 0 or 1 for every view in the classification model. Let \(L({y}_{i},f({x}_{i}^{(j)},{\beta }^{(j)}))\) denotes the loss function, which calculates the loss between the real label yi and the estimated value \(f({x}_{i}^{(j)},{\beta }^{(j)})\) in the j-th view. The β(j) represents the model parameter inside the decision function \(f({x}_{i}^{(j)},{\beta }^{(j)})\). The objective function of MVSPL can be expressed as:
where m denotes the total number of views. \({x}_{i}^{(j)}\) is the i-th input sample (i = 1, 2, …, n) under the j-th view, and yi is the corresponding label of \({x}_{i}^{(j)}\) for every j. \({v}_{i}^{(j)}\) denotes the weight of \({x}_{i}^{(j)}\). λ(j) is a tuning parameter in the j-th view, it controls the complexity of the model. γ(j) denotes the age parameter, which controls the learning pace in each iteration in the j-th view. δ is the parameter controls influence from other views when one view is going to select more training samples.
MVSPL actually corresponds to the sum of SPL model under multiple views plus a regularization term \({\sum }_{\begin{array}{c}1\le k,j\le m\\ k\ne j\end{array}}{({v}^{(k)})}^{T}{v}^{(j)}\). This inner product encodes the relationship between multiple views. This new regularizer demonstrates the basic assumption that multi-view data usually contains complementary information and have more comprehensive information than those of single-view data. Therefore, this new regularizer enforces the weight penalizing the loss of one view similar to that of other views.
The alternative optimization strategy
The alternative optimization strategy (AOS) can be used to solve the MVSPL model. The optimization process is as follows:
Initialization
v(1), v(2), …, v(m) are zero vectors in Rm. γ(1), γ(2), …, γ(m) are initialized with small values to allow a few samples into training for the first iteration. δ is set as a specific value in the whole learning process. Multiple classifiers are simultaneously trained on all samples in different views to obtain an initial loss of all samples in each view.
Update v i (k)(k = 1, 2,…, m; k ≠ j)
The purpose of this step is to prepare confident samples with non-zeros \({v}_{i}^{(k)}\) values for training on the j-th view. By calculating the derivative of Eq. (1) with respect to \({v}_{i}^{(k)}\), then we can obtain:
According to Eq. (2), we can obtain the optimal weight for the i-th sample in the k-th view:
Update v i (j)
This step aims to define which samples will be selected into the training of the j-th view. The optimization process for the vi(j) is the same as the previous step, expressed as:
The difference is that the samples selected in this step will be directly used for training in the j-th view. Furthermore, we can easily observe that samples selected by other views possess higher probabilities than others to be selected into training.
Update β (j)
The purpose of this step is to obtain the optimal solution for the j-th view. Here, we choose the logistic regression classifier to train the model. Equation (1) degenerates into penalized logistic regression optimization problem:
This problem can be readily solved by R package glmnet42.
Age parameter γ(j)(j = 1, 2, …, m) is increased to allow more samples with larger loss values into training in the next iteration. When γ(j) is small, only select easy samples under j-th view with small losses. With the growth of the γ(j), more samples under j-th view with larger losses will be gradually selected to train a more “mature” model. Then we repeat the above optimization process with respect to each variable under the different views until the maximum iteration times is reached.
The pipeline of the proposed MVSPL is shown in Supplementary Fig. S1. And the whole process of this alternative optimization strategy for solving MVSPL is summarized in Algorithm 1.
According to Algorithm 1, the MVSPL model can obtain the optimal solution for each view. Algorithm 1 jointly learns the modal parameter β(j) and the latent weight variables v(j), where j = 1, …, m. Steps 7–11 compute the latent weight variables of all samples n in multiple views m with the time complexity of O(n × m2). With the latent weight variables fixed, Step 12 computes the optimal solution based on the generalized linear model with lasso penalty by using Coordinate Descent algorithm42 with the time complexity of O(n2 × p), where p represents the number of features and n ≪ p. This step computes the optimal solution in multiple views, so the time complexity is O(n2 × p × m). Due to m ≪ n, therefore, the time complexity of Algorithm 1 is O(n2 × p × m).
In the test phase, when the test dataset D′ = {X1, X2, …, Xu} with multiple views (1, 2, …, m) are coming, where u is the number of test samples. We first fix β(1), β(2), …, β(m), and then predict the optimal yk by solving the following minimization problem:
Related work
Self-paced learning (SPL)
The self-paced learning model combines a weighted loss term for all samples and a general self-paced regularizer imposed on the samples weight. Suppose given a dataset D = {(X1, y1), (X2, y2), …, (Xn, yn)}, where Xi = (xi1, xi2,…, xip) is the i-th input sample with p features and yi is class of the i-th sample (e.g. yi ∈ {0, 1}). Let L(yi,f(xi, β)) denotes the loss function, which calculates the loss between the real label yi and the estimated value f(xi, β). The β represents the model parameter inside the decision function f(xi, β). The goal of the SPL is to jointly learn the model parameter β and the latent weight variable v = [v1, v2, …, vn] by minimizing:
where γ is the age parameter for controlling the learning pace and λ is a tuning parameter. The alternative optimization strategy algorithm can effectively solve the SPL problem. When β is fixed, the optimum weight variable \({v}^{\ast }=[{v}_{1}^{\ast },{v}_{2}^{\ast \ast },\mathrm{...},{v}_{n}^{\ast }]\) can be calculated by:
By jointly updating model parameter β and the latent weight variable v, we can conclude that: (1) When updating v with a fixed β, if the loss value of a sample is smaller than the age parameter γ, then the sample is treated as an easy sample with \({v}_{i}^{\ast }=1\), otherwise, \({v}_{i}^{\ast }=0\). (2) When updating β with a fixed v, using the selected samples (\({v}_{i}^{\ast }=1\)) to train the classifier. (3) Before running the next iteration, increase the age parameter γ to adjust the learning pace. When γ is small, only select easy samples with small loss values. With γ increases, more samples with larger losses will be gradually selected to train a more “mature” model.
By jointly learning the model parameter β and the latent weight variable v based on the iterative algorithm with gradually increasing the age parameter, more samples can be automatically selected into training from easy to complex in a self-paced way.
Results
We demonstrate the performance of the proposed MVSPL in simulation and real microarray experiments. Four methods are compared with the MVSPL method: Sparse logistic regression with the Lasso penalty (L1)43, Sparse logistic regression with the elastic net penalty (LEN)44, Ensemble-based elastic net (Ensemble_EN)45 and SPL32. When MVIAm generates single-view data, it degenerates into traditional “early stage” data integration, and data analysis can be performed by L1, LEN and SPL. Ensemble_EN constructs a prediction model on each view of data before combing the model predictions and obtains the final prediction result based on Eq. (6).
Analysis of simulated data
We generate three independent simulated datasets for integration and each dataset with the character of small sample size and high dimensionality. Using the normal distribution to generate X = (X1, X2, …, Xn) with n samples and each samples with p features, for the i-th sample, Xi = (xi1, xi2, …, xip). After that, the correlation parameter ρ can be added to the simulated data46.
where zij~i.i.d.N(0, 1). The simulated dataset is generated from the logistic regression model, which can be given as:
where ε = (ε1, ε2, …, εn)T is the independent random errors from N(0, 1), σ is the noise control parameter.
We generated simulated data by the above procedure. Three independent simulated datasets were generated with the same number of variables (p = 2000). The coefficient β is set as follows:
Four scenarios were designed for the simulated experiment:
Scenario 1: The sample size ndataset1 = 100, ndataset2 = 100 and ndataset3 = 100, the correlation coefficient ρ = 0, 0.2, 0.4, 0.6 and 0.8, the noise control parameter σ = 0.
Scenario 2: The sample size ndataset1 = 100, ndataset2 = 100 and ndataset3 = 100, the noise control parameter σ = 0, 0.2, 0.4, 0.6 and 0.8, the correlation coefficient ρ = 0.
Scenario 3: The sample size ndataset1 = 50, ndataset2 = 100 and ndataset3 = 150, the noise control parameter σ = 0, 0.4 and 0.8, the correlation coefficient ρ = 0.
Scenario 4: The sample size ndataset1 = 100, ndataset2 = 100 and ndataset3 = 100, the noise control parameter σdataset1 = 0.1, σdataset2 = 0.2 and σdataset3 = 0.3, the correlation coefficient ρ = 0.2.
Three independent simulated datasets are processed based on MVIAm and aggregated into a large multi-view dataset. We use four functions ComBat_p, ComBat_n, ber and ber_bg to eliminate batch effects and generate view1, view2, view3 and view4 of the aggregated multi-view data, respectively. L1, LEN and SPL achieve the best performance in the view of data by using ComBat_p to eliminate the batch effects. Therefore, these three competing methods use the view1 of the aggregated dataset for data analysis in four scenarios. The proposed MVSPL and Ensemble_EN have the flexibility to analyze data in multiple views. In Scenarios 1, 2 and 3, MVSPL and Ensemble_EN perform data analysis through two views of data: view1 and view2. In Scenario 4, we further explore our proposed method and its flexible scalability. Perform MVSPL through the interaction of two views, three views and four views of data, respectively. In the simulated experiment, we first combine independent simulated datasets into a large aggregated dataset. Then, the aggregated dataset is divided into two groups with random sampling, 70% samples for training and remaining samples for testing. The estimation of the optimal regularization parameter λ of the training dataset is obtained by 10-fold cross-validation. We repeat this procedure 30 times and report the average measurement.
To evaluate the prediction performance of classifiers, the accuracy, sensitivity, specificity and AUC are used in the simulation and real experiments. The definitions of these evaluation indicators can refer to47,48. In addition, the evaluation indicators for variable selection are defined as follows49:
where the |·|0 represents the number of non-zero elements in a vector. The logical not operators of β and \(\hat{\beta }\) are \(\bar{\beta }\) and \(\overline{\hat{\beta }}\), respectively. And.* is the element-wise product.
In Scenario 1, we explored the effect of different correlation coefficient parameters on the performance of the five methods. As shown in Fig. 2, for the training dataset, the difference in prediction performance of all the methods is quite small. For the test dataset, it can be clearly seen that as the correlation parameter ρ increases, the prediction accuracy of all the five methods are decreased, expect for MVSPL in ρ = 0.8. The generalization ability of MVSPL and SPL are obviously superior to L1, LEN and Ensemble_EN. The average test accuracy, sensitivity, and AUC obtained by MVSPL are higher than the other competing methods with varying correlation coefficient parameters ρ. The results obtained by SPL are slightly inferior to MVSPL but better than the other three methods in most situations. Moreover, Ensemble_EN outperforms L1 and LEN with varying correlation parameters.
In Scenario 2, we explored the effect of different noise control parameters on the performance of the five methods. As shown in Fig. 3, consistent with the results of Scenario 1, all methods with the similar prediction performance in the training dataset. For the test dataset, when the noise control parameter increases, the prediction accuracy of all the competing methods are decreased. MVSPL and SPL demonstrate the excellent generalization performance. The average test accuracy and AUC obtained by MVSPL are superior to other competing methods with varying noise control parameters σ. For instance, with noise parameter σ = 0.4, the average test accuracy of MVSPL is 87.84% superior to 85.04%, 84.96%, 87.11% and 85.44% obtained by L1, LEN, SPL and Ensemble_EN, respectively. In addition, the average test prediction performance of Ensemble_EN performs better than the single-view based methods L1 and LEN in all cases of Scenario 2.
Table 1 shows the variable selection performance of all the five methods in Scenarios 1 and 2. β-sensitivity and β-specificity are used to evaluate the variable selection performance. It can be obviously seen that our method achieves the best β-sensitivity performance across all cases of simulated experiments. For instance, with noise parameters σ = 0.6, the average β-sensitivity performance of MVSPL is 91.73% higher than 91.12%, 91.94%, 90.23% and 91.67% obtained by L1, LEN, SPL and Ensemble_EN, respectively. Moreover, by analyzing more views of data, it can improve the β-sensitive performance and help identify the significant variables. The average β-sensitivity of MVSPL and Ensemble_EN are superior to other single-view analysis methods in most cases. For example, the average β-sensitivity of MVSPL and Ensemble_EN are 91.09% and 90.34% better than 88.18%, 88.91% and 88.48% obtained by L1, LEN and SPL with the noise parameter σ = 0.8. The β-specificity of all the methods is relatively close in different parameters, between 97.0% to 99%.
In Scenario 3, we explored the effect of different sample sizes on the performance of the five methods. As shown in Fig. 4, we can clearly observe that the test accuracy of MVSPL has achieved the optimal results. MVSPL and SPL exhibit better generalization capabilities compared to other methods, especially in high noise case σ = 0.8. Furthermore, the test accuracy of multi-view based method Ensemble_EN is superior to the single-view based methods L1 and LEN in Scenario 3.
To further evaluate the performance of the proposed MVSPL method, we designed Scenario 4 in the simulated experiment. The prediction performance of MVSPL in the different number of views is shown in Supplementary Fig. S2. When the number of views increases, the accuracy, sensitivity, specificity and AUC for the test dataset obtained by MVSPL are improved. And we also compare the prediction performance of MVSPL in three views and each of its views. Supplementary Fig. S3 clearly shows that the prediction performance in each single views of MVSPL is worse than that of MVSPL in all views.
To sum up, according to the results of simulated experiments, we can conclude that:
-
MVSPL achieves the best generalization ability than the competing methods. The performance of MVSPL outperforms other competing methods with varying correlation parameters and noise parameters.
-
By analyzing more views of data, it possible to improve the prediction and variable selection performance. The average performance of MVSPL and Ensemble_EN are superior to the corresponding single-view based methods in most cases.
-
When the number of views increases, the prediction performance of MVSPL are improved. This implies that batch effects have an effect for data analysis and more views will contain more comprehensive information.
Real microarray datasets
We curated data from eight publicly available microarray studies, four breast cancer datasets (same platform) and four lung cancer datasets (disparate platform) (Tables 2 and 3). All of these four breast datasets were produced by the same microarray platform HG-U133A. Classification of breast cancer samples aims to distinguish between the sample’s estrogen receptor (ER) status (+ve or −ve). Four publicly available lung cancer microarray datasets come from disparate platforms. All these publicly available cancer gene expression datasets can be download from GEO (https://www.ncbi.nlm.nih.gov/geo/).
Analysis of real data
For the real microarray data, two types of experimental designs are used in this work. One type evaluates the performance using a random partition. The other type validates the prediction performance on the independent datasets. All publicly available cancer datasets are processed and aggregated in the manner described above (Supplementary Tables S1 and S2). All of publicly available gene expression datasets used in this paper have the class information. Special note, L1, LEN and SPL achieve the best performance in the view of data by using ComBat_p to eliminate the batch effects. Therefore, these three methods use this view of the aggregated dataset for data analysis in real data analysis. MVSPL and Ensemble_EN analyze two views of data in the real data experiments, which use ComBat_p and ComBat_n to eliminate the batch effects.
Evaluating the performance using a random partition
For the part of evaluating the performance using a random partition, we randomly divide the datasets such that 70% of the datasets become the training samples and the remaining samples become the test samples. The estimation of the optimal regularization parameter λ of the training dataset is obtained by 10-fold cross-validation. We repeat this procedure 30 times and report the average measurement and standard error.
Figures 5 and 6 plot the box plot analysis of training and test prediction performance calculated on breast and lung cancer datasets under 30 repetitions, respectively. As shown in Fig. 5, for the training dataset, all the five methods achieve desirable performance. For instance, the median average training accuracy of all methods have obtained more than 94%. For the test dataset, the proposed MVSPL has the superior performance compared to other competing methods. For example, the median test accuracy of MVSPL is 84.21%, which is obviously better than 75.44%, 77.19%, 80.70% and 76.90% obtained by L1, LEN, SPL and Ensemble_EN, respectively. Our method achieves the best generalization ability than the competing methods. For lung cancer dataset, as shown in Fig. 6, the training and test prediction performance of all the five methods have reached more than 90%. Our proposed MVSPL method still obtains better classification accuracy, sensitivity, specificity and AUC than other methods. The average number of selected genes for all methods is summarized in Supplementary Table S3.
Validating the classifier on independent dataset
For the part of validating the classifier on independent dataset, the design of the validation process is the same as that of metAnalyzeAll22. After pre-processing each dataset individually, all the training datasets and the independent validation dataset are merged in the manner described above. The classifier is trained on the samples from the aggregated training dataset and the optimal regularization parameter λ is obtained by 10-fold cross-validation. After that, the classifier is tested on the samples from the independent validation dataset.
Figure 7 compares the validation prediction performance of L1, LEN, SPL, Ensemble_EN and MVSPL in the validation datasets of breast cancer and lung cancer studies. Validating classifiers on the validation dataset, MVSPL consistently outperforms other competing methods in cancer classification problem. As shown in the left hand of Fig. 7, in breast cancer study, the validation accuracy, specificity, and AUC of MVSPL is superior to other competing methods, except for sensitivity. Specially, MVSPL achieves approximate 10% validation accuracy gain compared with L1 and LEN. Beyond that, Ensemble_EN with the suboptimal performance. In breast cancer study, multi-view analysis method performs better validation prediction performance than single-view analysis method. For lung cancer study, as shown in the right hand of Fig. 7, the validation prediction performance of the proposed MVSPL method has a significant improvement compared to other methods. For example, the validation sensitivity of MVSPL is 91.30%, which is superior to 43.24%, 45.95%, 78.26% and 73.91% obtained by L1, LEN, SPL and Ensemble_EN, respectively. The validation prediction performance of SPL is inferior to MVSPL but is obviously superior to L1, LEN and Ensemble_EN. Moreover, the validation results of Ensemble_EN is outperformed than L1 and LEN. To summary, by learning from easy to complex samples and interact with multiple views, MVSPL with the best generalization ability than other competing methods. Generally speaking, MVSPL can be successfully applied to the microarray integrative analysis in cancer classification. The average number of selected genes for all methods is summarized in Supplementary Table S4.
For a brief biological analysis of selected genes, we summaries of the 20 top-ranked genes selected by the five integrative analysis methods in two cancer studies, which are shown in Tables 4 and 5, respectively. To make it easier to demonstrate the interplay between the top selected genes from the microarray integrative analysis, we constructed an network of interactions among the genes using the cBioPortal50,51. Figure 8 shows the interactive network of the 20 top-ranked genes selected by MVSPL in breast cancer study. The interactive network shows that SNAPC5, PCBP2 and GNA13 are connected to other frequently altered genes from the TCGA breast invasive carcinoma dataset, which are also selected by other competing methods. Moreover, TNFSF11 is targeted by two FDA approved cancer drugs, it is selected only by MVSPL and SPL. For the genes that are only selected by MVSPL, UBE21 is connected to other frequently altered genes and RNASE2 is targeted by three cancer drugs. For lung cancer study, Fig. 9 shows the interactive network of the 20 top-ranked genes obtained by the proposed MVSPL in lung cancer study. Examination of the resulting network, Fig. 9 shows that TRPC3, DCC, MYH1, GH2 and KLHL21 are linked to other frequently altered genes from the TCGA lung adenocarcinoma dataset. MYH1 and GGT5 are targeted by certain cancer drugs. Moreover, MLNR, IGHE and RPL10L are only obtained by MVSPL, these genes are targets for cancer drugs.
In addition, a number of genes selected by the five methods have been reported in the literature. For example, in breast cancer, downregulation of ALOX15 expression has been reported in52,53. The upregulated expression of CDK14 promotes tumor cell proliferation, migration and invasion through Wnt/β— catenin signaling pathway in breast cancer54. UPK3A is highly expressed in breast cancer55, which is selected only by MVSPL and SPL. Beyond that, MVSPL selects some other unique genes compared with other methods. Phuong et al.56 confirmed that MAT2A expression in TAM-resistant human breast cancer tissues was higher than that in TAM-responsive cases. Nass et al.57 proposed that NNAT expression determined by immunohistochemistry might therefore become a helpful additional biomarker to identify high-risk breast cancer patients. For lung cancer, Greenman et al.58 reported in 2005 that the role of TTN as a cancer gene is currently a mathematically based prediction and will require direct biological evaluation. And after a few years, Tan H et al.59 said TTN and/or MUC16 were retained in the top 10 for lung cancer, suggesting their tumorigenic relevance to these cancers. MASP1 is over expressed in lung cancer60. In this part, we analysis the 20 top-ranked genes selected by the five methods in two cancer studies in gene level. According to the network of interactions among the genes, we find a few numbers of genes are connected to other frequently altered genes from the publicly available datasets and some genes are targeted by certain cancer drugs.
Conclusion
Due to the complexity of gene expression data, there are four major issues constrain the development of microarray technology in clinical applications: high noise, large p & small n problem, batch effects and low reproducibility of significant biomarkers. In this work, we design a novel framework called MVIAm to strive to tackle these issues. MVIAm utilizes different cross-platform normalization methods to minimize the impact of batch effects, keeps as much useful information as possible in the microarray gene expression data. In addition, the aggregated gene expression datasets generated by MVIAm belong to multi-view data. It implies that MVIAm can significantly alleviate the large p & small n problem compared to the existing integrative analysis methods. Therefore, MVIAm can increase the statistical power in identifying the significant biomarkers. To analysis of multi-view gene expression data, we propose a robust learning mechanism called MVSPL to minimize high noise interference. The MVSPL method can improve the generalization performance by learning multi-view data in a meaningful order and improve the prediction performance by the interaction between multiple views. MVSPL actually corresponds to the sum of SPL model under multiple views plus a regularization term. This method implements robust learning regimes in multiple views under the regularization that the robust loss forms in multiple views are closely related. According to the results of simulation and real data experiments, MVSPL has the superior performance compared with L1, LEN, SPL and Ensemble_EN. Especially in the test and validation dataset, MVSPL shows prominent generalization performance. In a word, MVSPL is a feasible and effective method for variable selection and classification in high dimensional data.
There are some ongoing challenges and promising directions that motivate future work. First, our proposed method conducts variable selection with aggregated microarray data in an “all-in-or-all-out” fashion, that is, a gene identified in all of studies or not identified in any study. However, due to data heterogeneity, there may be some genes are important in some studies while unimportant in others. In the future, we will take this situation into account to improve our model. Second, rapid advances in technology have led to a vast quantity of large-scale molecular omics datasets, it provides a distinct view of the complex biological system. Multi-omics dataset with the same set of samples but several distinct feature sets, which naturally belongs to multi-view data. In the future, we will apply our method to the analysis of multi-omics data. We think the computational analysis of the multi-omics data provides an unprecedented opportunity to deepen our understanding of complex cancer mechanisms. Our proposed method makes integrative analysis more systematic and expands its range of applications.
Data Availability
The code of this paper can be download from https://github.com/must-bio-team/MVIAm.
References
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets-update. Nucleic acids research 41, D991–D995 (2012).
Pepe, M. S. & Feng, Z. Improving biomarker identification with better designs and reporting. Clinical Chemistry 1093–1095 (2011).
Draghici, S. Statistical intelligence: effective analysis of high-density microarray data. Drug discovery today 7, S55–S63 (2002).
Kitchen, R. R. et al. Relative impact of key sources of systematic noise in affymetrix and illumina gene-expression microarray experiments. BMC genomics 12, 589 (2011).
Bolón-Canedo, V., Sánchez-Marono, N., Alonso-Betanzos, A., Benítez, J. M. & Herrera, F. A review of microarray datasets and applied feature selection methods. Inf. Sci 282, 111–135 (2014).
Wang, Y., Miller, D. & Clarke, R. Approaches to working in high-dimensional data spaces: gene expression microarrays. Br. journal cancer 98, 1023 (2008).
Liang, Y. et al. Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC bioinformatics 14, 198 (2013).
Yang, Z. Y. et al. Robust sparse logistic regression with the L q(0 < q < 1) regularization for feature selection using gene expression data. IEEE Access 6, 68586–68595 (2018).
Larkin, J. E., Frank, B. C., Gavras, H., Sultana, R. & Quackenbush, J. Independence and reproducibility across microarray platforms. Nat. methods 2, 337 (2005).
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733 (2010).
Shen, R., Chinnaiyan, A. M. & Ghosh, D. Pathway analysis reveals functional convergence of gene expression profiles in breast cancer. BMC medical genomics 1, 28 (2008).
Tseng, G. C., Ghosh, D. & Feingold, E. Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic acids research 40, 3785–3799 (2012).
Sørlie, T. et al. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. national academy sciences 100, 8418–8423 (2003).
Hamid, J. S. et al. Data integration in genetics and genomics: methods and challenges. Hum. genomics proteomics: HGP 2009 (2009).
Rhodes, D. R. et al. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc. Natl. Acad. Sci. 101, 9309–9314 (2004).
Choi, J. K., Yu, U., Kim, S. & Yoo, O. J. Combining multiple microarray studies and modeling interstudy variation. Bioinformatics 19, i84–i90 (2003).
Chang, L.-C., Lin, H.-M., Sibille, E. & Tseng, G. C. Meta-analysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline. BMC bioinformatics 14, 368 (2013).
Lusa, L., Gentleman, R. & Ruschhaupt, M. Genemeta: metaanalysis for high throughput experiments. R package version 1 (2006).
Parmigiani, G., Garrett, E. S., Anbazhagan, R. & Gabrielson, E. A statistical framework for expression-based molecular classification in cancer. J. Royal Stat. Soc. Ser. B (Statistical Methodol.) 64, 717–736 (2002).
Ma, S. & Huang, J. Regularized gene selection in cancer microarray meta-analysis. BMC bioinformatics 10, 1 (2009).
Li, Q., Wang, S., Huang, C.-C., Yu, M. & Shao, J. Meta-analysis based variable selection for gene expression data. Biometrics 70, 872–880 (2014).
Hughey, J. J. & Butte, A. J. Robust meta-analysis of gene expression using the elastic net. Nucleic acids research 43, e79–e79 (2015).
Walsh, C., Hu, P., Batt, J. & Santos, C. Microarray meta-analysis and cross-platform normalization: integrative genomics for robust biomarker discovery. Microarrays 4, 389–406 (2015).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics 8, 118–127 (2007).
Shabalin, A. A., Tjelmeland, H., Fan, C., Perou, C. M. & Nobel, A. B. Merging two gene-expression studies via cross-platform normalization. Bioinformatics 24, 1154–1160 (2008).
Giordan, M. A two-stage procedure for the removal of batch effects in microarray studies. Stat. Biosci. 6, 73–84 (2014).
Chen, C. et al. Removing batch effects in analysis of expression microarray data: an evaluation of six batch adjustment methods. PloS one 6, e17238 (2011).
Li, Y., Wu, F.-X. & Ngom, A. A review on machine learning principles for multi-view biological data integration. Briefings bioinformatics 19, 325–340 (2016).
Li, Y., Yang, M. & Zhang, Z. M. A survey of multi-view representation learning. IEEE Transactions on Knowl. Data Eng. (2018).
Zhao, J., Xie, X., Xu, X. & Sun, S. Multi-view learning overview: Recent progress and new challenges. Inf. Fusion 38, 43–54 (2017).
Singh, A. et al. Diablo: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics (2019).
Kumar, M. P., Packer, B. & Koller, D. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, 1189–1197 (2010).
Shu, J. et al. Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting. arXiv preprint arXiv, 1902.07379 (2019).
Bengio, Y., Louradour, J., Collobert, R. & Weston, J. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, 41–48 (ACM, 2009).
Kumar, M. P., Turki, H., Preston, D. & Koller, D. Learning specific-class segmentation from diverse data. In Computer Vision (ICCV), 2011 IEEE International Conference on, 1800–1807 (IEEE, 2011).
Tang, K., Ramanathan, V., Fei-Fei, L. & Koller, D. Shifting weights: Adapting object detectors from image to video. In Advances in Neural Information Processing Systems, 638–646 (2012).
Jiang, L., Meng, D., Mitamura, T. & Hauptmann, A. G. Easy samples first: Self-paced reranking for zero-example multimedia search. In Proceedings of the 22nd ACM international conference on Multimedia, 547–556 (ACM, 2014).
Chai, H., Li, Z.-N., Meng, D.-Y., Xia, L.-Y. & Liang, Y. A new semi-supervised learning model combined with cox and sp-aft models in cancer survival analysis. Sci. reports 7, 13053 (2017).
Meng, D., Zhao, Q. & Jiang, L. A theoretical understanding of self-paced learning. Inf. Sci. 414, 319–328 (2017).
Irizarry, R. A. et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4, 249–264 (2003).
Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology 5, R80 (2004).
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. statistical software 33, 1 (2010).
Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal Stat. Soc. Ser. B (Methodological) 267–288 (1996).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. Royal Stat. Soc. Ser. B (Statistical Methodol.) 67, 301–320 (2005).
Günther, O. P. et al. A computational pipeline for the development of multi-marker bio-signature panels and ensemble classifiers. BMC bioinformatics 13, 326 (2012).
Sohn, I., Kim, J., Jung, S.-H. & Park, C. Gradient lasso for cox proportional hazards model. Bioinformatics 25, 1775–1781 (2009).
Baratloo, A., Hosseini, M., Negida, A. & El Ashal, G. Part 1: simple definition and calculation of accuracy, sensitivity and specificity. Emergency 3, 48–49 (2015).
Lobo, J. M., Jiménez-Valverde, A. & Real, R. Auc: a misleading measure of the performance of predictive distribution models. Glob. ecology Biogeogr. 17, 145–151 (2008).
Zhang, W. et al. Molecular pathway identification using biological network-regularized logistic models. BMC genomics 14, S7 (2013).
Gao, J. et al. Integrative analysis of complex cancer genomics and clinical profiles using the cbioportal. Sci. Signal. 6, pl1–pl1 (2013).
Cerami, E. et al. The cbio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data (2012).
Jiang, W. G., Watkins, G., Douglas-Jones, A. & Mansel, R. E. Reduction of isoforms of 15-lipoxygenase (15-lox)-1 and 15-lox-2 in human breast cancer. Prostaglandins, Leukot. Essent. Fat. Acids 74, 235–245 (2006).
Ho, C. F.-Y. et al. Expression of dha-metabolizing enzyme alox15 is regulated by selective histone acetylation in neuroblastoma cells. Neurochem. research 43, 540–555 (2018).
Gu, X. et al. Upregulated pftk1 promotes tumor cell proliferation, migration, and invasion in breast cancer. Med. Oncol. 32, 195 (2015).
Network, C. G. A. R. et al. Comprehensive molecular characterization of urothelial bladder carcinoma. Nature 507, 315 (2014).
Phuong, N. T. T. et al. Induction of methionine adenosyltransferase 2a in tamoxifen-resistant breast cancer cells. Oncotarget 7, 13902 (2016).
Nass, N. et al. High neuronatin (nnat) expression is associated with poor outcome in breast cancer. Virchows Arch. 471, 23–30 (2017).
Greenman, C. et al. Patterns of somatic mutation in human cancer genomes. Nature 446, 153 (2007).
Tan, H., Bao, J. & Zhou, X. Genome-wide mutational spectra analysis reveals significant cancer-specific heterogeneity. Sci. reports 5, 12566 (2015).
Kang, J. U., Koo, S. H., Kwon, K. C., Park, J. W. & Kim, J. M. Identification of novel candidate target genes, including ephb3, masp1 and sst at 3q26. 2-q29 in squamous cell carcinoma of the lung. BMC cancer 9, 237 (2009).
Acknowledgements
This work is partially supported by the Chinese Ministry of Education’s Tian Cheng Hui Zhi Innovation and Education Improvement Funds (Grant No. 2018A01014), the Macau Science and Technology Develop Funds (Grant No. 0055/2018/A2) of Macao SAR of China and China NSFC project under contract 61661166011.
Author information
Authors and Affiliations
Contributions
Z.Y.Y., J.S. and Y.L. proposed the Novel MVIAm integrative framework and proposed multi-view self-paced learning approach, designed the algorithm, wrote the code and manuscript, X.Y.L., H.Z. and Y.Q.R. provided the real data and analysis the information of biology, Z.B.X. provided the technical support. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, ZY., Liu, XY., Shu, J. et al. Multi-view based integrative analysis of gene expression data for identifying biomarkers. Sci Rep 9, 13504 (2019). https://doi.org/10.1038/s41598-019-49967-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-019-49967-4
- Springer Nature Limited
This article is cited by
-
Joint triplet loss with semi-hard constraint for data augmentation and disease prediction using gene expression data
Scientific Reports (2023)
-
A tensor decomposition-based integrated analysis applicable to multiple gene expression profiles without sample matching
Scientific Reports (2022)
-
An application of machine learning regression to feature selection: a study of logistics performance and economic attribute
Neural Computing and Applications (2022)
-
Robust Data Integration Method for Classification of Biomedical Data
Journal of Medical Systems (2021)
-
Identification of early liver toxicity gene biomarkers using comparative supervised machine learning
Scientific Reports (2020)