Abstract
Meiotic recombination presents an uneven distribution across the genome. Genomic regions that exhibit at relatively high frequencies of recombination are called hotspots, whereas those with relatively low frequencies of recombination are called coldspots. Therefore, hotspots and coldspots would provide useful information for the study of the mechanism of recombination. In this study, we proposed a computational predictor called iRSpot-DACC to predict hot/cold spots across the yeast genome. It combined Support Vector Machines (SVMs) and a feature called dinucleotide-based auto-cross covariance (DACC), which is able to incorporate the global sequence-order information and fifteen local DNA properties into the predictor. Combined with Principal Component Analysis (PCA), its performance was further improved. Experimental results on a benchmark dataset showed that iRSpot-DACC can achieve an accuracy of 82.7%, outperforming some highly related methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Introduction
Meiotic recombination is the process alleles exchange between homologous chromosomes during meiosis1,2. It plays an important role in the process of genome evolution3,4. Since recombination can produce diverse gametes, so it provides material for natural selection. Moreover, Recombination also influences the genome evolution via gene conversion or mutagenesis5,6.
Although the mechanism of recombination is still unclear, it has been assured that recombination plays an important part in promoting genome evolution. The distribution pattern of recombination position has drawn much attention and several studies have been performed on chromosomes7,8,9. Some studies have found that recombination presents an uneven distribution across the genome. Genomic regions that exhibit at relatively high frequencies of recombination are called hotspots, while those with relatively low frequencies of recombination are called coldspots10,11. In the era of rapid development of biology sequencing technology, the number of sequenced genome shows explosive growth. Therefore, it is necessary to develop stable methods for the identification of recombination spots.
Although a great deal of recombination information can be acquired from experiments concerning recombination, identifying recombination hot/cold spots by using the information of DNA sequence is still a challenging task. Recently, several models have been proposed to predict recombination hotspots and coldspots. For example, Zhou et al.12 proposed a SVM-based method based on codon composition to identify hotspots from coldspots. Later, Jiang et al.13 employed the Random Forest classifier trained with the gapped dinucleotide composition features to identify hotspots from coldspots in Saccharomyces cerevisiae. Guo et al.14 proposed a SVM model based on DNA physical properties to predict hot/cold spots in yeast. Combining increment of diversity with quadratic discriminant analysis (IDQD), Liu et al.1 presented a model based on sequence k-mer frequencies along with DNA sequences. Wu et al.15 proposed a SVM model based on the features of genomic and epigenomic to predict meiotic recombination hotspots in human and mouse. Chen et al.16 presented a SVM model based on pseudo dinucleotide composition. Wang et al.17 proposed a method based on gapped kmers. Most of these predictors only considered the local sequence-order information, while little global sequence-order information was taken into account. However, in many bioinformatics’ tasks, the global sequence-order information has showed strong discriminative power as shown in many studies. Therefore, in a predictor, the global sequence-order factor should be incorporated. Unfortunately, it is not an easy job, because the lengths of DNA sequences are different.
To address this problem, a feature called dinucleotide-based auto-cross covariance (DACC)18 is applied to recombination hot/cold spots identification, which is able to incorporate the global sequence-order effects in the DNA sequences into the predictor. Combined with Support Vector Machines (SVMs), a predictor called iRSpot-DACC is proposed. Later, in order to further improve its performance and computational cost, Principal Component Analysis (PCA)19 is adopted. Experimental results on a benchmark dataset demonstrate that the proposed method outperformed some highly related models, including IDQD1 and iRSpot-PseDNC16.
Results
Influence of parameters on the predictive performance of iRSpot-DACC
In iRSpot-DACC, there is a parameter, the distance between two dinucleotides lag, would affect its predictive performance. In the current study, lag is optimized via the 5-fold cross validation. The influence of lag on the performance of iRSpot-DACC is shown in Fig. 1, from which we can see that the optimized value can be achieved when lag = 6, and this parameter has little impact on the performance. DACC is the combination of Dinucleotide-based auto covariance (DAC) and Dinucleotide-based cross covariance (DCC) (cf. section Material and Methods). With this parameter setting, the lengths of the feature vectors for DAC and DCC are 15 × 6 = 90 and 15 × 14 × 6 = 1260 respectively. Therefore, the dimension of DACC is 90 + 1260 = 1350.
The computational performance of iRSpot-DACC can be further improved by using PCA
In order to further improve its performance and computational cost of iRSpot-DACC, the Principal Component Analysis (PCA)19 is employed.
There is a parameter w (cf. Eq. (18)) in PCA, which would have impact on both the predictive accuracy and the dimension of the feature vectors. Therefore, we optimize this parameter utilizing 5-fold cross validation. The results show that the iRSpot-DACC-PCA (iRSpot-DACC combined with PCA) achieves the best performance when w = 0.99 and its performance is shown in Table 1, from which we can see that iRSpot-DACC-PCA outperforms iRSpot-DACC.
The feature vector’s dimension of iRSpot-DACC-PCA is 173, which is significantly smaller than the original dimension of iRSpot-DACC (1350). Therefore, the predictive accuracy and the computational cost of iRSpot-DACC are further improved by using PCA.
Discriminative visualization and interpretation
In order to further explore the discriminative power and indicate the meaning of the feature space in biology, we calculate the discriminative weight vector according to the study20. The specific formula of the feature discriminative weight vector W can be formulated as:
where A is the specific weight for each training samples obtained from SVM training process; M is the feature space of the benchmark dataset used in the current study; N is the number of DNA sequences in the training dataset; j is the dimension of the feature vector. Therefore, W is a 1 × j vector and each element in it represents the corresponding feature’s discriminative power.
The feature discriminative weight vector with 1350 features (cf. section Results) is depicted in Fig. 2, in which the deeper color spots represent stronger discriminative power than the lighter color spots. From Fig. 2 we can see that the top three discriminative features are DAC(2, 3), DCC(2, 8, 3) and DCC(2, 15, 1). All the three features are deduced from the same property (F-tilt), which suggests the importance of this property of F-tilt (μ = 2). The top ten discriminative features are listed in Table 2. In this table, we can conclude several conclusions. First, the correlation between properties F-roll (μ = 1) and several other properties shows strongly discriminative power for identifying recombination hot/cold spots. Second, the correlation between F-tilt (μ = 2) and other properties including itself also shows strongly discriminative power. Third, when the distance between two dinucleotides equals to 1, 2, 3 or 5, the influence of the corresponding features would be important for identifying hot/cold spots.
Comparison with other related predictors
Two methods for hot/cold spots identification are compared with the proposed methods iRSpot-DACC and iRSpot-DACC-PCA, including IDQD1 and iRSpot-PseDNC16. The results of various methods on the benchmark dataset S are shown in Table 1.
According to Table 1, we can see that iRSpot-DACC outperforms the two methods IDQD1 and iRSpot-PseDNC16. Furthermore, iRSpot-DACC-PCA outperforms iRSpot-DACC by adopting Principal Component Analysis (PCA). The main reasons are described as follows: IDQD1 only consider the local sequence-order information, and iRSpot-PseDNC16 improves it by incorporating global sequence-order information. However, iRSpot-DACC not only incorporates the global sequence-order information but also contains more DNA properties into the feature vectors. Therefore, we conclude that iRSpot-DACC would be a useful tool for hot/cold spots identification.
Discussion
In this study, we propose a computation method called iRSpot-DACC for yeast hot/cold spots identification. The method incorporates long range or global sequence-order information. The result shows that iRSpot-DACC outperform other state-of-the-art predictors. Furthermore, iRSpot-DACC incorporates the correlations between different dinucleotide DNA properties. Another important advantage of our approach derived from PCA (principal component analysis)21 which not only can improve the predictive accuracy, but also can reduce the computational cost. It can be expected that DACC would be a powerful feature extraction method, and it can be applied to other tasks in the field of bioinformatics, such as DNA-binding proteins identification22, protein fold prediction23,24, cytokine detection25,26, protein-protein interaction site prediction27, tumor classification and analysis28, etc. Moreover, since publicly accessible web-server is beneficial to develop more useful predictors, we would make efforts in our future work to develop a web-server for the method proposed in this paper. Furthermore, we will apply other advanced machine learning techniques to establish more accurate predictors for hot spot identification, such as deep learning, and neural networks29,30,31,32.
Material and Methods
Benchmark Dataset
The benchmark dataset used in this study was constructed by Jiang et al.13, which contains 490 hotspots and 591 coldspots. For more detailed information of this benchmark dataset, please refer to13.
Therefore, the benchmark dataset for the current study can be expressed as:
where S+ is the set of recombination hotspots, S− is the set of recombination coldspots, and is a mathematical operator representing “union”.
Dinucleotide-based auto-cross covariance (DACC)
As described above, the global sequence-order information shows strongly discriminative power for identifying recombination hot/cold spots. Therefore, it is crucial to incorporate the global sequence-order information into our model. In order to deal with this problem, a feature called Dinucleotide-based auto-cross covariance (DACC)18 is adopted, which incorporates global sequence-order information along DNA sequences. DACC is the combination of Dinucleotide-based auto covariance (DAC) and Dinucleotide-based cross covariance (DCC). Next, we will introduce DAC and DCC respectively.
Given a DNA sequence D
where L is the length of DNA sequence, R1 means the nucleic acid residue at the first position in the sequence, R2 means the nucleic acid residue at the second position and so forth.
The DAC18,33,34 represents the correlation of one DNA local property between two dinucleotides at a distance of lag in the sequence. DAC can be calculated by:
and
where μ is the index of dinucleotide local property; L represents the DNA sequence length; Pμ(RiRi+1) means the value of the dinucleotide RiRi+1 at position i for the local property index μ; is the average value of Pμ(RiRi+1) for a DNA sequence and can be calculated as:
In such way, the feature vector’s length of DAC is N*LAG, where N is the number of dinucleotide properties used in this study and LAG is the maximum of lag .
The DCC33,34,35 calculates the correlation of two different properties between two dinucleotides at a distance lag nucleic acid residues in the DNA sequence. DCC can be calculated by using the following equation:
and
where μ1, μ2 are two different property indices, L represents the DNA sequence length; Pμ1(RiRi+1) Pμ2(RiRi+1)) is the numerical value of the dinucleotide (RiRi+1) at position i for the property index μ1 (μ2); is the average value for property index value μ1 (μ2) along the whole sequence and have the same form with Eq. (6). In such way, the feature vector’s length of DCC is N * (N − 1) * LAG, where N is the number of dinucleotide properties used in this study and LAG is the maximum of lag . The processes for generating the feature vectors of DAC and DCC are presented in the Fig. 3(a,b) respectively.
In this study, fifteen properties from36 are used. Their values are listed in Table 3.
Support vector machine (SVM)
Support Vector Machine (SVM) is a pattern recognition technique introduced by Vapnik37, which has been employed for many computational tasks in bioinformatics38,39,40,41. It seeks an optimal hyperplane via transforming the original feature space into a high dimensional vector space to achieve classification.
In the current study, the ANACONDA package (http://www.continuum.io/) is adopted, which contains the implementation of SVM. The selected kernel function is radial basis function (RBF), which is defined as:
Two parameters, the regularization parameter C and the kernel width parameter γ are optimized on the dataset by using a grid tool provided by ANACONDA. In the current study, the values of the two parameters are shown below:
Principal Component Analysis (PCA)
Feature selections are able to remove the noise so as to improve the classification performance42. In order to reduce redundant information, in this study, we adopt Principal Component Analysis (PCA)19 to reduce the dimension of the original feature vectors. It reduces the dimension of the feature vectors through projecting a feature space onto a smaller subspace that represents the dataset well.
Suppose, the original feature space of iRSpot-DACC can be represented as:
where N is the number of training sample, k is the dimension of the feature vectors. Then, the averages for every dimension of X can be expressed as:
where N and k have the same meaning with Eq. (11). Therefore, the matrix which is composed of mean vectors for every dimension in X can be represented as:
where eij represents the element of X and can be acquired from Eq. (12).
Then, the covariance matrix and its eigenvalues can be calculated and the eigenvalues can be represented as:
Next, l eigenvectors whose corresponding eigenvalues are more bigger than other eigenvectors’ are chosen to form a matrix, which can be represented as:
where each column represents an eigenvector and their corresponding eigenvalues can be represented as:
where . Finally, the new subspace M can be calculated by
Therefore, the dimension of the feature space is reduced from k to l. The values of k and l have been discussed in section Results.
The selection of principal components is based on the cumulative weight ratio w:
The values of w and l have been discussed in section Results.
Jackknife test
In statistical prediction, three cross-validation methods including independent dataset test, sub-sampling (or K-fold cross-validation) test and jackknife test are often used to measure the performance of a predictor43,44,45. Among the three methods, jackknife test is deemed the most objective which urging it to be widely adopted by researchers to evaluate the performance of various classifiers. Therefore, in the current study, jackknife test is also adopted to measure the performance of iRSpot-DACC and iRSpot-DACC-PCA. In the jackknife test, each sequence in the benchmark dataset would be selected as test sample and the corresponding remaining samples as training samples.
Criteria for performance evaluation
Sensitivity (Se), Specificity (Sp), Accuracy (Acc), and Matthew’s Correlation Coefficient (Mcc)46 are used to evaluate the performance of different methods. They are defined as follows:
where TP, FP, TN and FN represent the true positive, false positive, true negative and false negative respectively.
Additional Information
How to cite this article: Liu, B. et al. iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance. Sci. Rep. 6, 33483; doi: 10.1038/srep33483 (2016).
References
Liu, G., Liu, J., Cui, X. & Cai, L. Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. Journal of theoretical biology 293, 49–54 (2012).
Lynn, A., Ashley, T. & Hassold, T. Variation in human meiotic recombination. Annu. Rev. Genomics Hum. Genet. 5, 317–349 (2004).
Lewin, B. Genes VIII. 8th. 428–456 (New Jersey: Pearson/Prentice-Hall, Upper Saddle River, 2004).
Spencer, C. C. et al. The influence of recombination on human genetic diversity. PLoS Genet 2, e148 (2006).
Galtier, N., Piganeau, G., Mouchiroud, D. & Duret, L. GC-Content Evolution in Mammalian Genomes: The Biased Gene Conversion Hypothesis. Genetics 159, 907–911 (2001).
Lercher, M. J. & Hurst, L. D. Human SNP variability and mutation rate are higher in regions of high recombination. Trends in genetics 18, 337–340 (2002).
Baudat, F. & Nicolas, A. Clustering of meiotic double-strand breaks on yeast chromosome III. Proceedings of the National Academy of Sciences 94, 5213–5218 (1997).
Klein, S. et al. Patterns of meiotic double-strand breakage on native and artificial yeast chromosomes. Chromosoma 105, 276–284 (1996).
Liu, B., Wang, S., Long, R. & Chou, K.-C. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics, doi: 10.1093/bioinformatics/btw539 (2016).
Mancera, E., Bourgon, R., Brozzi, A., Huber, W. & Steinmetz, L. M. High-resolution mapping of meiotic crossovers and non-crossovers in yeast. Nature 454, 479–485 (2008).
Gerton, J. L. et al. Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. Proceedings of the National Academy of Sciences 97, 11383–11390 (2000).
Zhou, T., Weng, J., Sun, X. & Lu, Z. Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition. BMC Bioinformatics 7, 223 (2006).
Jiang, P. et al. RF-DYMHC: detecting the yeast meiotic recombination hotspots and coldspots by random forest model using gapped dinucleotide composition features. Nucleic Acids Research 35, W47–W51 (2007).
Guo, S.-H., Xu, L.-Q., Chen, W., Liu, G.-Q. & Lin, H. Recombination spots prediction using DNA physical properties in the saccharomyces cerevisiae genome. AIP Conference Proceedings 1479, 1556–1559 (2012).
Wu, M., Kwoh, C. K., Przytycka, T. M., Li, J. & Zheng, J. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine 297–304 (ACM, Orlando, Florida, 2012).
Chen, W., Feng, P.-M., Lin, H. & Chou, K.-C. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic acids research, gks1450 (2013).
Wang, R., Xu, Y. & Liu, B. Recombination spot identification Based on gapped k-mers. Scientific reports 6 (2016).
Liu, B. et al. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic acids research 43, W65–W71 (2015).
Liu, B., Chen, J. & Wang, X. Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis. Molecular Genetics and Genomics 290, 1919–1931 (2015).
Liu, B. et al. Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. Journal of theoretical biology 385, 153–159 (2015).
Peason, K. On lines and planes of closest fit to systems of point in space. Philosophical Magazine 2, 559–572 (1901).
Song, L. et al. nDNA-prot: identification of DNA-binding proteins based on unbalanced classification. BMC bioinformatics 15, 1 (2014).
Wei, L., Liao, M., Gao, X. & Zou, Q. Enhanced protein fold prediction method through a novel feature extraction technique. IEEE transactions on nanobioscience 14, 649–659 (2015).
Zhao, X., Zou, Q., Liu, B. & Liu, X. Exploratory predicting protein folding model with random forest and hybrid features. Current Proteomics 11, 289–299 (2014).
Zou, Q. et al. An approach for identifying cytokines based on a novel ensemble classifier. BioMed research international 2013 (2013).
Zeng, X., Yuan, S., Huang, X. & Zou, Q. Identification of cytokine via an improved genetic algorithm. Frontiers of Computer Science 9, 643–651 (2015).
Wang, B. et al. Predicting protein interaction sites from residue spatial sequence profile and evolution rate. FEBS letters 580, 380–384 (2006).
Huang, D.-S. & Zheng, C.-H. Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics 22, 1855–1862 (2006).
Huang, D.-s. Radial basis probabilistic neural networks: model and application. International Journal of Pattern Recognition and Artificial Intelligence 13, 1083–1101 (1999).
Huang, D.-S. A constructive approach for finding arbitrary roots of polynomials by neural networks. IEEE Transactions on Neural Networks 15, 477–491 (2004).
Huang, D.-S. & Du, J.-X. A constructive hybrid structure optimization methodology for radial basis probabilistic neural networks. IEEE Transactions on Neural Networks 19, 2099–2115 (2008).
Zhang, J.-R., Zhang, J., Lok, T.-M. & Lyu, M. R. A hybrid particle swarm optimization–back-propagation algorithm for feedforward neural network training. Applied Mathematics and Computation 185, 1026–1037 (2007).
Dong, Q., Zhou, S. & Guan, J. A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation. Bioinformatics 25, 2655–2662 (2009).
Chen, W. et al. PseKNC-General: A cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics, doi: 10.1093/bioinformatics/btu602 (2014).
Chen, W., Lei, T. Y., Jin, D. C., Lin, H. & Chou, K. C. PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Analytical biochemistry 456, 53–60 (2014).
Liu, G., Xing, Y. & Cai, L. Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae. Journal of theoretical biology 382, 15–22 (2015).
Vapnik, V. N. & Vapnik, V. Statistical learning theory. Vol. 1 (Wiley: New York,, 1998).
Liu, B., Wang, S., Dong, Q., Li, S. & Liu, X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE Transactions on NanoBioscience, doi: 10.1109/TNB.2016.2555951 (2016).
Zou, Q., Mao, Y., Hu, L., Wu, Y. & Ji, Z. miRClassify: an advanced web server for miRNA family classification and annotation. Comput Biol Med 45, 157–160 (2014).
Dapeng, L., Ying, J. & Quan, Z. Protein Folds Prediction with Hierarchical Structured SVM. Current Proteomics 13, 79–85 (2016).
Chen, W. & Lin, H. Prediction of midbody, centrosome and kinetochore proteins based on gene ontology information. Biochemical and biophysical research communications 401, 382–384 (2010).
Zou, Q., Zeng, J., Cao, L. & Ji, R. A novel features ranking metric with application to scalable visual and bioinformatics data classification. Neurocomputing 173, 346–354 (2016).
Chen, W., Tran, H., Liang, Z., Lin, H. & Zhang, L. Identification and analysis of the N(6)-methyladenosine in the Saccharomyces cerevisiae transcriptome. Sci Rep 5, 13859 (2015).
Liu, B., Fang, L., Long, R., Lan, X. & Chou, K.-C. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics 32, 362–369 (2016).
Chen, W., Feng, P., Ding, H., Lin, H. & Chou, K.-C. iRNA-methyl: identifying N 6-methyladenosine sites using pseudo nucleotide composition. Analytical biochemistry 490, 26–33 (2015).
Chen, J., Wang, X. & Liu, B. iMiRNA-SSF: improving the identification of MicroRNA precursors by combining negative sets with different distributions. Scientific reports 6 (2016).
Chang, C.-C. & Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 27 (2011).
Acknowledgements
This work was supported by the National High Technology Research and Development Program of China (863 Program) [2015AA015405], the National Natural Science Foundation of China (No. 61300112, 61672184, 61573118 and 61272383, 61572151), the Natural Science Foundation of Guangdong Province (2014A030313695), Guangdong Natural Science Funds for Distinguished Young Scholars (2016A030306008), and Scientific Research Foundation in Shenzhen (Grant No. JCYJ20150626110425228).
Author information
Authors and Affiliations
Contributions
B.L. conceived of the study and designed the experiments, participated in designing the study, drafting the manuscript and performing the statistical analysis. Y.L. participated in coding the experiments and drafting the manuscript. B.Q.L., X.P.J. and X.L.W. participated in performing the statistical analysis. All authors read and approved the final manuscript.
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
Liu, B., Liu, Y., Jin, X. et al. iRSpot-DACC: a computational predictor for recombination hot/cold spots identification based on dinucleotide-based auto-cross covariance. Sci Rep 6, 33483 (2016). https://doi.org/10.1038/srep33483
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/srep33483
- Springer Nature Limited
This article is cited by
-
Use Chou’s 5-steps rule to identify DNase I hypersensitive sites via dinucleotide property matrix and extreme gradient boosting
Molecular Genetics and Genomics (2020)