Dataset Characteristics (Metafeatures)

Brazdil, Pavel; van Rijn, Jan N.; Soares, Carlos; Vanschoren, Joaquin

doi:10.1007/978-3-030-67024-5_4

Pavel Brazdil⁶,
Jan N. van Rijn⁷,
Carlos Soares⁸ &
…
Joaquin Vanschoren⁹

Part of the book series: Cognitive Technologies ((COGTECH))

12k Accesses
1 Citations

Summary

This chapter discusses dataset characteristics that play a crucial role in many metalearning systems. Typically, they help to restrict the search in a given configuration space. The basic characteristic of the target variable, for instance, determines the choice of the right approach. If it is numeric, it suggests that a suitable regression algorithm should be used, while if it is categorical, a classification algorithm should be used instead. This chapter provides an overview of different types of dataset characteristics, which are sometimes also referred to as metafeatures. These are of different types, and include so-called simple, statistical, information-theoretic, model-based, complexitybased, and performance-based metafeatures. The last group of characteristics has the advantage that it can be easily defined in any domain. These characteristics include, for instance, sampling landmarkers representing the performance of particular algorithms on samples of data, relative landmarkers capturing differences or ratios of performance values and providing estimates of performance gains. The final part of this chapter discusses the specific dataset characteristics used in different machine learning tasks, including classification, regression, time series, and clustering.

Download to read the full chapter text

Chapter PDF

References

Adya, M., Collopy, F., Armstrong, J., and Kennedy, M. (2001). Automatic identification of time series features for rule-based forecasting. International Journal of Forecasting, 17(2):143–157.
Google Scholar
Aha, D. W. (1992). Generalizing from case studies: A case study. In Sleeman, D. and Edwards, P., editors, Proceedings of the Ninth InternationalWorkshop on Machine Learning (ML92), pages 1–10. Morgan Kaufmann.
Google Scholar
Atkeson, C. G., Moore, A. W., and Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11(1-5):11–73.
Google Scholar
Baldi, P. and Chauvin, Y. (1993). Neural networks for fingerprint recognition. Neural Computation, 5.
Google Scholar
Bensusan, H. (1998). God doesn’t always shave with Occam’s razor - learning when and how to prune. In ECML ’98: Proceedings of the 10th European Conference on Machine Learning, pages 119–124, London, UK. Springer-Verlag.
Google Scholar
Bensusan, H. and Giraud-Carrier, C. (2000). Discovering task neighbourhoods through landmark learning performances. In Zighed, D. A., Komorowski, J., and Zytkow, J., editors, Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), pages 325–330. Springer.
Google Scholar
Bensusan, H., Giraud-Carrier, C., and Kennedy, C. (2000). A higher-order approach to meta-learning. In Proceedings of the ECML 2000 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, pages 109–117. ECML 2000.
Google Scholar
Bensusan, H. and Kalousis, A. (2001). Estimating the predictive accuracy of a classifier. In Flach, P. and De Raedt, L., editors, Proceedings of the 12th European Conference on Machine Learning, pages 25–36. Springer.
Google Scholar
Box, G. and Jenkins, G. (2008). Time Series Analysis, Forecasting and Control. John Wiley & Sons.
Google Scholar
Brazdil, P., Gama, J., and Henery, B. (1994). Characterizing the applicability of classification algorithms using meta-level learning. In Bergadano, F. and De Raedt, L., editors, Proceedings of the European Conference on Machine Learning (ECML94), pages 83–102. Springer-Verlag.
Google Scholar
Brazdil, P. and Henery, R. J. (1994). Analysis of results. In Michie, D., Spiegelhalter, D. J., and Taylor, C. C., editors, Machine Learning, Neural and Statistical Classification, chapter 10, pages 175–212. Ellis Horwood.
Google Scholar
Brazdil, P., Soares, C., and da Costa, J. P. (2003). Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results. Machine Learning, 50(3):251–277.
Google Scholar
Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., and Shah, R. (1994). Signature verification using a “siamese” time delay neural network. In Advances in Neural Information Processing Systems 7, NIPS’94, pages 737–744.
Google Scholar
Chatfield, C. (2003). The Analysis of Time Series: An Introduction. Chapman & Hall/CRC, 6th edition.
Google Scholar
Chen, K. and Salman, A. (2011). Extracting speaker-specific information with a regularized Siamese deep network. In Advances in Neural Information Processing Systems 24, NIPS’11, pages 298–306.
Google Scholar
Costa, A. J., Santos, M. S., Soares, C., and Abreu, P. H. (2020). Analysis of imbalance strategies recommendation using a meta-learning approach. In 7th ICML Workshop on Automated Machine Learning (AutoML).
Google Scholar
Cunha, T., Soares, C., and de Carvalho, A. C. (2018a). cf2vec: Collaborative filtering algorithm selection using graph distributed representations. arXiv preprint arXiv:1809.06120.
Cunha, T., Soares, C., and de Carvalho, A. C. (2018b). Metalearning and recommender systems: A literature review and empirical study on the algorithm selection problem for collaborative filtering. Information Sciences, 423:128 – 144.
Google Scholar
da Costa, J. P. (2015). Rankings and Preferences: New Results in Weighted Correlation and Weighted Principal Component Analysis with Applications. Springer.
Google Scholar
da Costa, J. P. and Soares, C. (2005). A weighted rank measure of correlation. Aust. N.Z. J. Stat., 47(4):515–529.
Google Scholar
de Souto, M. C. P., Prudencio, R. B. C., Soares, R. G. F., de Araujo, D. S. A., Costa, I. G., Ludermir, T. B., and Schliep, A. (2008). Ranking and selecting clustering algorithms using a meta-learning approach. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 3729–3735.
Google Scholar
dos Santos, P. M., Ludermir, T. B., and Prudêncio, R. B. C. (2004). Selection of time series forecasting models based on performance information. In Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04), pages 366–371.
Google Scholar
Ferrari, D. and de Castro, L. (2015). Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods. Information Sciences, 301:181–194.
Google Scholar
Fürnkranz, J. and Petrak, J. (2001). An evaluation of landmarking variants. In Giraud- Carrier, C., Lavrač, N., and Moyle, S., editors, Working Notes of the ECML/PKDD 2000 Workshop on Integrating Aspects of Data Mining, Decision Support and Meta-Learning, pages 57–68.
Google Scholar
Fusi, N., Sheth, R., and Elibol, M. (2018). Probabilistic matrix factorization for automated machine learning. In Advances in Neural Information Processing Systems 32, NIPS’18, pages 3348–3357. Gama, J. and Brazdil, P. (1995). Characterization of classification algorithms. In Pinto-Ferreira, C. and Mamede, N. J., editors, Progress in Artificial Intelligence, Proceedings of the Seventh Portuguese Conference on Artificial Intelligence, pages 189–200. Springer-Verlag.
Google Scholar
Hilario, M. and Kalousis, A. (2001). Fusion of meta-knowledge and meta-data for case based model selection. In Siebes, A. and De Raedt, L., editors, Proceedings of the Fifth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD01). Springer.
Google Scholar
Ho, T. and Basu, M. (2002). Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):289–300.
Google Scholar
Kalousis, A. (2002). Algorithm Selection via Meta-Learning. PhD thesis, University of Geneva, Department of Computer Science.
Google Scholar
Kalousis, A. and Hilario, M. (2000). Model selection via meta-learning: A comparative study. In Proceedings of the 12th International IEEE Conference on Tools with AI. IEEE Press.
Google Scholar
Kalousis, A. and Hilario, M. (2001a). Feature selection for meta-learning. In Cheung, D. W., Williams, G., and Li, Q., editors, Proc. of the Fifth Pacific-Asia Conf. on Knowledge Discovery and Data Mining. Springer.
Google Scholar
Kalousis, A. and Hilario, M. (2001b). Model selection via meta-learning: a comparative study. Int. Journal on Artificial Intelligence Tools, 10(4):525–554.
Google Scholar
Kalousis, A. and Hilario, M. (2003). Representational issues in meta-learning. In Proceedings of the 20th International Conference on Machine Learning, ICML’03, pages 313–320.
Google Scholar
Kalousis, A. and Theoharis, T. (1999). NOEMON: Design, implementation and performance results of an intelligent assistant for classifier selection. Intelligent Data Analysis, 3(5):319–337.
Google Scholar
Köpf, C. and Iglezakis, I. (2002). Combination of task description strategies and case base properties for meta-learning. In Bohanec, M., Kavšek, B., Lavrač, N., and Mladenić, D., editors, Proceedings of the Second International Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning (IDDM-2002), pages 65–76. Helsinki University Printing House.
Google Scholar
Köpf, C., Taylor, C., and Keller, J. (2000). Meta-analysis: From data characterization for meta-learning to meta-regression. In Brazdil, P. and Jorge, A., editors, Proceedings of the PKDD 2000 Workshop on Data Mining, Decision Support, Meta-Learning and ILP: Forum for Practical Problem Presentation and Prospective Solutions, pages 15–26.
Google Scholar
Kuhn, M. and Johnson, K. (2013). Applied Predictive Modeling. Springer.
Google Scholar
Leite, R. and Brazdil, P. (2004). Improving progressive sampling via meta-learning on learning curves. In Boulicaut, J.-F., Esposito, F., Giannotti, F., and Pedreschi, D., editors, Proc. of the 15th European Conf. on Machine Learning (ECML2004), LNAI 3201, pages 250–261. Springer-Verlag.
Google Scholar
Leite, R. and Brazdil, P. (2005). Predicting relative performance of classifiers from samples. In Proceedings of the 22nd International Conference on Machine Learning, ICML’05, pages 497–503, NY, USA. ACM Press.
Google Scholar
Leite, R. and Brazdil, P. (2007). An iterative process for building learning curves and predicting relative performance of classifiers. In Proceedings of the 13th Portuguese Conference on Artificial Intelligence (EPIA 2007), pages 87–98.
Google Scholar
Leite, R. and Brazdil, P. (2021). Exploiting performance-based similarity between datasets in metalearning. In Guyon, I., van Rijn, J. N., Treguer, S., and Vanschoren, J., editors, AAAI Workshop on Meta-Learning and MetaDL Challenge, volume 140, pages 90–99. PMLR.
Google Scholar
Leite, R., Brazdil, P., and Vanschoren, J. (2012). Selecting classification algorithms with active testing. In Machine Learning and Data Mining in Pattern Recognition, pages 117–131. Springer.
Google Scholar
Lemke, C. and Gabrys, B. (2010). Meta-learning for time series forecasting and forecast combination. Neurocomputing, 74:2006–2016.
Google Scholar
Lindner, G. and Studer, R. (1999). AST: Support for algorithm selection with a CBR approach. In Giraud-Carrier, C. and Pfahringer, B., editors, Recent Advances in Meta-Learning and Future Work, pages 38–47. J. Stefan Institute.
Google Scholar
Lorena, A., Maciel, A., de Miranda, P., Costa, I., and Prudêncio, R. (2018). Data complexity meta-features for regression tasks. Machine Learning, 107(1):209–246.
Google Scholar
Manning, C., Raghavan, P., and Schütze, H. (2009). An Introduction to Information Retrieval. Cambridge University Press.
Google Scholar
Michie, D., Spiegelhalter, D. J., and Taylor, C. C. (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood.
Google Scholar
Muñoz, M., Villanova, L., Baatar, D., and Smith-Miles, K. (2018). Instance Spaces for Machine Learning Classification. Machine Learning, 107(1).
Google Scholar
Mueller, J. and Thyagarajan, A. (2016). Siamese recurrent architectures for learning sentence similarity. In Thirtieth AAAI Conference on Artificial Intelligence.
Google Scholar
Peng, Y., Flach, P., Brazdil, P., and Soares, C. (2002). Improved dataset characterisation for meta-learning. In Discovery Science, pages 141–152.
Google Scholar
Perez, E. and Rendell, L. (1996). Learning despite concept variation by finding structure in attribute-based data. In Proceedings of the 13th International Conference on Machine Learning, ICML’96.
Google Scholar
Pfahringer, B., Bensusan, H., and Giraud-Carrier, C. (2000). Meta-learning by landmarking various learning algorithms. In Langley, P., editor, Proceedings of the 17th International Conference on Machine Learning, ICML’00, pages 743–750.
Google Scholar
Pimentel, B. A. and de Carvalho, A. C. (2019). A new data characterization for selecting clustering algorithms using meta-learning. Information Sciences, 477:203 – 219.
Google Scholar
Pinto, F. (2018). Leveraging Bagging for Bagging Classifiers. PhD thesis, University of Porto, FEUP.
Google Scholar
Pinto, F., Soares, C., and Mendes-Moreira, J. (2016). Towards automatic generation of metafeatures. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 215–226. Springer International Publishing.
Google Scholar
Post, M. J., van der Putten, P., and van Rijn, J. N. (2016). Does feature selection improve classification? a large scale experiment in OpenML. In Advances in Intelligent Data Analysis XV, pages 158–170. Springer.
Google Scholar
Prudêncio, R. and Ludermir, T. (2004). Meta-learning approaches to selecting time series models. Neurocomputing, 61:121–137.
Google Scholar
Rendell, L. and Seshu, R. (1990). Learning hard concepts through constructive induction: Framework and rationale. Computational Intelligence, 6:247–270.
Google Scholar
Rendell, L., Seshu, R., and Tcheng, D. (1987). More robust concept learning using dynamically-variable bias. In Proceedings of the Fourth International Workshop on Machine Learning, pages 66–78. Morgan Kaufmann Publishers, Inc.
Google Scholar
Rice, J. R. (1976). The algorithm selection problem. Advances in Computers, 15:65–118.
Google Scholar
Rivolli, A., Garcia, L. P. F., Soares, C., Vanschoren, J., and de Carvalho, A. C. P. L. F. (2019). Characterizing classification datasets: a study of meta-features for metalearning. In arXiv. https://arxiv.org/abs/1808.10406.
Smith, M. R., Martinez, T., and Giraud-Carrier, C. (2014). An instance level analysis of data complexity. Machine Learning, 95(2):225–256.
Google Scholar
Smith-Miles, K., Baatar, D., Wreford, B., and Lewis, R. (2014). Towards objective measures of algorithm performance across instance space. Computers & Operations Research, 45:12–24.
Google Scholar
Smith-Miles, K. A. (2008). Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Computing Surveys (CSUR), 41(1):6:1–6:25.
Google Scholar
Soares, C. (2004). Learning Rankings of Learning Algorithms. PhD thesis, Department of Computer Science, Faculty of Sciences, University of Porto.
Google Scholar
Soares, C. and Brazdil, P. (2006). Selecting parameters of SVM using meta-learning and kernel matrix-based meta-features. In Proceedings of the ACM SAC.
Google Scholar
Soares, C., Brazdil, P., and Kuba, P. (2004). A meta-learning method to select the kernel width in support vector regression. Machine Learning, 54:195–209.
Google Scholar
Soares, C., Petrak, J., and Brazdil, P. (2001). Sampling-based relative landmarks: Systematically test-driving algorithms before choosing. In Brazdil, P. and Jorge, A., editors, Proceedings of the 10th Portuguese Conference on Artificial Intelligence (EPIA2001), pages 88–94. Springer.
Google Scholar
Soares, R. G. F., Ludermir, T. B., and De Carvalho, F. A. T. (2009). An analysis of metalearning techniques for ranking clustering algorithms applied to artificial data. In Alippi, C., Polycarpou, M., Panayiotou, C., and Ellinas, G., editors, Artificial Neural Networks – ICANN 2009. Springer, Berlin, Heidelberg.
Google Scholar
Sohn, S. Y. (1999). Meta analysis of classification algorithms for pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(11):1137–1144.
Google Scholar
Sun, Q. and Pfahringer, B. (2013). Pairwise meta-rules for better meta-learning-based algorithm ranking. Machine Learning, 93(1):141–161.
Google Scholar
Todorovski, L., Blockeel, H., and Džeroski, S. (2002). Ranking with predictive clustering trees. In Elomaa, T., Mannila, H., and Toivonen, H., editors, Proc. of the 13th European Conf. on Machine Learning, number 2430 in LNAI, pages 444–455. Springer-Verlag.
Google Scholar
Todorovski, L., Brazdil, P., and Soares, C. (2000). Report on the experiments with feature selection in meta-level learning. In Brazdil, P. and Jorge, A., editors, Proceedings of the Data Mining, Decision Support, Meta-Learning and ILP Workshop at PKDD 2000, pages 27–39.
Google Scholar
Todorovski, L. and Džeroski, S. (1999). Experiments in meta-level learning with ILP. In Rauch, J. and Zytkow, J., editors, Proceedings of the Third European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD99), pages 98–106. Springer.
Google Scholar
Tomp, D., Muravyov, S., Filchenkov, A., and Parfenov, V. (2019). Meta-learning based evolutionary clustering algorithm. In Lecture Notes in Computer Science, Vol. 11871, pages 502–513.
Google Scholar
Tsuda, K., Rätsch, G., Mika, S., and Müller, K. (2001). Learning to predict the leave-one out error of kernel based classifiers. In ICANN, pages 331–338. Springer-Verlag.
Google Scholar
Tukey, J. (1977). Exploratory Data Analysis. Addison-Wesley Publishing Company.
Google Scholar
van Rijn, J. N., Abdulrahman, S., Brazdil, P., and Vanschoren, J. (2015). Fast algorithm selection using learning curves. In International Symposium on Intelligent Data Analysis XIV, pages 298–309.
Google Scholar
Vanschoren, J. (2019). Meta-learning. In Hutter, F., Kotthoff, L., and Vanschoren, J., editors, Automated Machine Learning: Methods, Systems, Challenges, chapter 2, pages 39–68. Springer.
Google Scholar
Vilalta, R. (1999). Understanding accuracy performance through concept characterization and algorithm analysis. In Giraud-Carrier, C. and Pfahringer, B., editors, Recent Advances in Meta-Learning and Future Work, pages 3–9. J. Stefan Institute.
Google Scholar
Vukicevic, M., Radovanovic, S., Delibasic, B., and Suknovic, M. (2016). Extending metalearning framework for clustering gene expression data with component-based algorithm design and internal evaluation measures. International Journal of Data Mining and Bioinformatics (IJDMB), 14(2).
Google Scholar
Yang, C., Akimoto, Y., Kim, D. W., and Udell, M. (2019). Oboe: Collaborative filtering for AutoML model selection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1173–1183. ACM.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Artificial Intelligence and Decision Support, University of Porto, Porto, Portugal
Pavel Brazdil
Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
Jan N. van Rijn
Porto Business School, Porto, Portugal
Carlos Soares
Department of Mathematics and Computer Science, Technische Universiteit Eindhoven, Eindhoven, The Netherlands
Joaquin Vanschoren

Authors

Pavel Brazdil
View author publications
You can also search for this author in PubMed Google Scholar
Jan N. van Rijn
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Soares
View author publications
You can also search for this author in PubMed Google Scholar
Joaquin Vanschoren
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Brazdil, P., van Rijn, J.N., Soares, C., Vanschoren, J. (2022). Dataset Characteristics (Metafeatures). In: Metalearning. Cognitive Technologies. Springer, Cham. https://doi.org/10.1007/978-3-030-67024-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-67024-5_4
Published: 22 February 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67023-8
Online ISBN: 978-3-030-67024-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics