Metadata Repositories

Brazdil, Pavel; van Rijn, Jan N.; Soares, Carlos; Vanschoren, Joaquin

doi:10.1007/978-3-030-67024-5_16

Pavel Brazdil⁶,
Jan N. van Rijn⁷,
Carlos Soares⁸ &
…
Joaquin Vanschoren⁹

Part of the book series: Cognitive Technologies ((COGTECH))

12k Accesses

Summary

This chapter presents a review of online repositories where researchers can share data, code, and experiments. In particular, it covers OpenML, an online platform for sharing and organizing machine learning data automatically. OpenML contains thousands of datasets and algorithms, and millions of experimental results. We describe the basic philosophy involved, and its basic components: datasets, tasks, flows, setups, runs, and benchmark suites. OpenML has API bindings in various programming languages, making it easy for users to interact with the API in their native language. One important feature of OpenML is the integration into various machine learning toolboxes, such as Scikit-learn, Weka, and mlR. Users of these toolboxes can automatically upload all their results, leading to a large repository of experimental results.

Download to read the full chapter text

Chapter PDF

OpenML: An R package to connect to the machine learning platform OpenML

Article 19 June 2017

A survey of open source tools for machine learning with big data in the Hadoop ecosystem

Article Open access 05 November 2015

Does Feature Selection Improve Classification? A Large Scale Experiment in OpenML

References

Bifet, A., Holmes, G., Kirkby, R., and Pfahringer, B. (2010a). MOA: Massive Online Analysis. J. Mach. Learn. Res., 11:1601–1604.
Google Scholar
Bifet, A., Holmes, G., and Pfahringer, B. (2010b). Leveraging Bagging for Evolving Data Streams. In Machine Learning and Knowledge Discovery in Databases, volume 6321 of Lecture Notes in Computer Science, pages 135–150. Springer.
Google Scholar
Bischl, B., Casalicchio, G., Feurer, M., Gijsbers, P., Hutter, F., Lang, M., Mantovani, R. G., van Rijn, J. N., and Vanschoren, J. (2021). OpenML benchmarking suites. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, NIPS’21.
Google Scholar
Bischl, B., Kerschke, P., Kotthoff, L., Lindauer, M., Malitsky, Y., Fr´echette, A., Hoos, H., Hutter, F., Leyton-Brown, K., Tierney, K., et al. (2016a). ASlib: A benchmark library for algorithm selection. Artificial Intelligence, 237:41–58.
Google Scholar
Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalicchio, G., and Jones, Z. M. (2016b). mlr: Machine Learning in R. Journal of Machine Learning Research, 17(170):1–5.
Google Scholar
Brazdil, P., Giraud-Carrier, C., Soares, C., and Vilalta, R. (2009). Metalearning: Applications to data mining. Springer.
Google Scholar
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). OpenAI Gym. arXiv:1606.01540.
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. (2013). API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining
Google Scholar
and Machine Learning, pages 108–122.
Google Scholar
Casalicchio, G., Bossek, J., Lang, M., Kirchhoff, D., Kerschke, P., Hofner, B., Seibold, H., Vanschoren, J., and Bischl, B. (2017). OpenML: An R package to connect to the machine learning platform OpenML. Computational Statistics.
Google Scholar
Chang, C. C. and Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27.
Google Scholar
Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., and Batista, G. (2015). The UCR time series classification archive. www.cs.ucr.edu/~eamonn/time_series_data/.
Dheeru, D. and Taniskidou, E. K. (2017). UCI machine learning repository.
Google Scholar
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., Blum, M., and Hutter, F. (2019a). Auto-sklearn: Efficient and robust automated machine learning. In Hutter, F., Kotthoff, L., and Vanschoren, J., editors, Automated Machine Learning: Methods, Systems, Challenges, pages 113–134. Springer.
Google Scholar
Feurer, M., van Rijn, J. N., Kadra, A., Gijsbers, P., Mallik, N., Ravi, S., M¨uller, A., Vanschoren, J., and Hutter, F. (2019b). OpenML-Python: an extensible Python API for OpenML. arXiv preprint arXiv:1911.02490.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA Data Mining Software: An Update. ACM SIGKDD Explorations Newsletter, 11(1):10–18.
Google Scholar
Hand, D. (2006). Classifier technology and the illusion of progress. Statistical Science, 21(1):1–14.
Google Scholar
Hirsh, H. (2008). Data mining research: Current status and future opportunities. Statistical Analysis and Data Mining, 1(2):104–107.
Google Scholar
Hofmann, M. and Klinkenberg, R. (2013). RapidMiner: Data Mining Use Cases and Business Analytics Applications. Data Mining and Knowledge Discovery. Chapman & Hall/CRC.
Google Scholar
Hoste, V. and Daelemans, W. (2005). Comparing learning approaches to coreference resolution. There is more to it than bias. In Giraud-Carrier, C., Vilalta, R., and Brazdil, P., editors, Proceedings of the ICML 2005 Workshop on Meta-Learning, pages 20–27.
Google Scholar
Hutson, M. (2018). Missing data hinder replication of artificial intelligence studies. Science.
Google Scholar
Keogh, E. and Kasetty, S. (2003). On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Mining and Knowledge Discovery, 7(4):349–371.
Google Scholar
Michie, D., Spiegelhalter, D. J., and Taylor, C. C. (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood.
Google Scholar
Olson, R. S., La Cava, W., Orzechowski, P., Urbanowicz, R. J., and Moore, J. H. (2017). PMLB: A large benchmark suite for machine learning evaluation and comparison. BioData Mining, 10(36).
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., and Dubourg, V. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct):2825–2830.
Google Scholar
Perlich, C., Provost, F., and Simonoff, J. (2003). Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research, 4:211–255.
Google Scholar
Pfahringer, B., Bensusan, H., and Giraud-Carrier, C. (2000). Meta-learning by landmarking various learning algorithms. In Langley, P., editor, Proceedings of the 17th International Conference on Machine Learning, ICML’00, pages 743–750.
Google Scholar
Read, J., Bifet, A., Pfahringer, B., and Holmes, G. (2012). Batch-Incremental versus Instance-Incremental Learning in Dynamic and Evolving Data. In Advances in Intelligent Data Analysis XI, pages 313–323. Springer.
Google Scholar
Sculley, D., Snoek, J., Wiltschko, A., and Rahimi, A. (2018). Winner’s curse? on pace, progress, and empirical rigor. In Workshop of the International Conference on Representation Learning (ICLR).
Google Scholar
Szalay, A. S., Gray, J., Thakar, A. R., Kunszt, P. Z., Malik, T., Raddick, J., Stoughton, C., and vandenBerg, J. (2002). The SDSS SkyServer: public access to the Sloan digital sky server data. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 570–581. ACM.
Google Scholar
Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., and Vlahavas, I. (2011). MULAN: A Java library for multi-label learning. JMLR, pages 2411–2414.
Google Scholar
van Rijn, J. N. (2016). Massively collaborative machine learning. PhD thesis, Leiden University.
Google Scholar
van Rijn, J. N., Holmes, G., Pfahringer, B., and Vanschoren, J. (2015). Having a Blast: Meta-Learning and Heterogeneous Ensembles for Data Streams. In 2015 IEEE International Conference on Data Mining (ICDM), pages 1003–1008. IEEE.
Google Scholar
van Rijn, J. N. and Vanschoren, J. (2015). Sharing RapidMiner workflows and experiments with OpenML. In Vanschoren, J., Brazdil, P., Giraud-Carrier, C., and Kotthoff, L., editors, Proceedings of the 2015 International Workshop on Meta-Learning and Algorithm Selection (MetaSel), number 1455 in CEUR Workshop Proceedings, pages 93–103.
Google Scholar
Vanschoren, J., Blockeel, H., Pfahringer, B., and Holmes, G. (2012). Experiment databases: a new way to share, organize and learn from experiments. Machine Learning, 87(2):127–158.
Google Scholar
Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. (2014). OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Artificial Intelligence and Decision Support, University of Porto, Porto, Portugal
Pavel Brazdil
Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
Jan N. van Rijn
Porto Business School, Porto, Portugal
Carlos Soares
Department of Mathematics and Computer Science, Technische Universiteit Eindhoven, Eindhoven, The Netherlands
Joaquin Vanschoren

Authors

Pavel Brazdil
View author publications
You can also search for this author in PubMed Google Scholar
Jan N. van Rijn
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Soares
View author publications
You can also search for this author in PubMed Google Scholar
Joaquin Vanschoren
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavel Brazdil .

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Brazdil, P., van Rijn, J.N., Soares, C., Vanschoren, J. (2022). Metadata Repositories. In: Metalearning. Cognitive Technologies. Springer, Cham. https://doi.org/10.1007/978-3-030-67024-5_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-67024-5_16
Published: 22 February 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67023-8
Online ISBN: 978-3-030-67024-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Metadata Repositories

Summary

Chapter PDF

Similar content being viewed by others

OpenML: An R package to connect to the machine learning platform OpenML

A survey of open source tools for machine learning with big data in the Hadoop ecosystem

Does Feature Selection Improve Classification? A Large Scale Experiment in OpenML

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Metadata Repositories

Summary

Chapter PDF

Similar content being viewed by others

OpenML: An R package to connect to the machine learning platform OpenML

A survey of open source tools for machine learning with big data in the Hadoop ecosystem

Does Feature Selection Improve Classification? A Large Scale Experiment in OpenML

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation