Summary
This chapter presents a review of online repositories where researchers can share data, code, and experiments. In particular, it covers OpenML, an online platform for sharing and organizing machine learning data automatically. OpenML contains thousands of datasets and algorithms, and millions of experimental results. We describe the basic philosophy involved, and its basic components: datasets, tasks, flows, setups, runs, and benchmark suites. OpenML has API bindings in various programming languages, making it easy for users to interact with the API in their native language. One important feature of OpenML is the integration into various machine learning toolboxes, such as Scikit-learn, Weka, and mlR. Users of these toolboxes can automatically upload all their results, leading to a large repository of experimental results.
Chapter PDF
Similar content being viewed by others
References
Bifet, A., Holmes, G., Kirkby, R., and Pfahringer, B. (2010a). MOA: Massive Online Analysis. J. Mach. Learn. Res., 11:1601–1604.
Bifet, A., Holmes, G., and Pfahringer, B. (2010b). Leveraging Bagging for Evolving Data Streams. In Machine Learning and Knowledge Discovery in Databases, volume 6321 of Lecture Notes in Computer Science, pages 135–150. Springer.
Bischl, B., Casalicchio, G., Feurer, M., Gijsbers, P., Hutter, F., Lang, M., Mantovani, R. G., van Rijn, J. N., and Vanschoren, J. (2021). OpenML benchmarking suites. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, NIPS’21.
Bischl, B., Kerschke, P., Kotthoff, L., Lindauer, M., Malitsky, Y., Fr´echette, A., Hoos, H., Hutter, F., Leyton-Brown, K., Tierney, K., et al. (2016a). ASlib: A benchmark library for algorithm selection. Artificial Intelligence, 237:41–58.
Bischl, B., Lang, M., Kotthoff, L., Schiffner, J., Richter, J., Studerus, E., Casalicchio, G., and Jones, Z. M. (2016b). mlr: Machine Learning in R. Journal of Machine Learning Research, 17(170):1–5.
Brazdil, P., Giraud-Carrier, C., Soares, C., and Vilalta, R. (2009). Metalearning: Applications to data mining. Springer.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). OpenAI Gym. arXiv:1606.01540.
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. (2013). API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining
and Machine Learning, pages 108–122.
Casalicchio, G., Bossek, J., Lang, M., Kirchhoff, D., Kerschke, P., Hofner, B., Seibold, H., Vanschoren, J., and Bischl, B. (2017). OpenML: An R package to connect to the machine learning platform OpenML. Computational Statistics.
Chang, C. C. and Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27.
Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., and Batista, G. (2015). The UCR time series classification archive. www.cs.ucr.edu/~eamonn/time_series_data/.
Dheeru, D. and Taniskidou, E. K. (2017). UCI machine learning repository.
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., Blum, M., and Hutter, F. (2019a). Auto-sklearn: Efficient and robust automated machine learning. In Hutter, F., Kotthoff, L., and Vanschoren, J., editors, Automated Machine Learning: Methods, Systems, Challenges, pages 113–134. Springer.
Feurer, M., van Rijn, J. N., Kadra, A., Gijsbers, P., Mallik, N., Ravi, S., M¨uller, A., Vanschoren, J., and Hutter, F. (2019b). OpenML-Python: an extensible Python API for OpenML. arXiv preprint arXiv:1911.02490.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA Data Mining Software: An Update. ACM SIGKDD Explorations Newsletter, 11(1):10–18.
Hand, D. (2006). Classifier technology and the illusion of progress. Statistical Science, 21(1):1–14.
Hirsh, H. (2008). Data mining research: Current status and future opportunities. Statistical Analysis and Data Mining, 1(2):104–107.
Hofmann, M. and Klinkenberg, R. (2013). RapidMiner: Data Mining Use Cases and Business Analytics Applications. Data Mining and Knowledge Discovery. Chapman & Hall/CRC.
Hoste, V. and Daelemans, W. (2005). Comparing learning approaches to coreference resolution. There is more to it than bias. In Giraud-Carrier, C., Vilalta, R., and Brazdil, P., editors, Proceedings of the ICML 2005 Workshop on Meta-Learning, pages 20–27.
Hutson, M. (2018). Missing data hinder replication of artificial intelligence studies. Science.
Keogh, E. and Kasetty, S. (2003). On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Mining and Knowledge Discovery, 7(4):349–371.
Michie, D., Spiegelhalter, D. J., and Taylor, C. C. (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood.
Olson, R. S., La Cava, W., Orzechowski, P., Urbanowicz, R. J., and Moore, J. H. (2017). PMLB: A large benchmark suite for machine learning evaluation and comparison. BioData Mining, 10(36).
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., and Dubourg, V. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct):2825–2830.
Perlich, C., Provost, F., and Simonoff, J. (2003). Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research, 4:211–255.
Pfahringer, B., Bensusan, H., and Giraud-Carrier, C. (2000). Meta-learning by landmarking various learning algorithms. In Langley, P., editor, Proceedings of the 17th International Conference on Machine Learning, ICML’00, pages 743–750.
Read, J., Bifet, A., Pfahringer, B., and Holmes, G. (2012). Batch-Incremental versus Instance-Incremental Learning in Dynamic and Evolving Data. In Advances in Intelligent Data Analysis XI, pages 313–323. Springer.
Sculley, D., Snoek, J., Wiltschko, A., and Rahimi, A. (2018). Winner’s curse? on pace, progress, and empirical rigor. In Workshop of the International Conference on Representation Learning (ICLR).
Szalay, A. S., Gray, J., Thakar, A. R., Kunszt, P. Z., Malik, T., Raddick, J., Stoughton, C., and vandenBerg, J. (2002). The SDSS SkyServer: public access to the Sloan digital sky server data. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 570–581. ACM.
Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., and Vlahavas, I. (2011). MULAN: A Java library for multi-label learning. JMLR, pages 2411–2414.
van Rijn, J. N. (2016). Massively collaborative machine learning. PhD thesis, Leiden University.
van Rijn, J. N., Holmes, G., Pfahringer, B., and Vanschoren, J. (2015). Having a Blast: Meta-Learning and Heterogeneous Ensembles for Data Streams. In 2015 IEEE International Conference on Data Mining (ICDM), pages 1003–1008. IEEE.
van Rijn, J. N. and Vanschoren, J. (2015). Sharing RapidMiner workflows and experiments with OpenML. In Vanschoren, J., Brazdil, P., Giraud-Carrier, C., and Kotthoff, L., editors, Proceedings of the 2015 International Workshop on Meta-Learning and Algorithm Selection (MetaSel), number 1455 in CEUR Workshop Proceedings, pages 93–103.
Vanschoren, J., Blockeel, H., Pfahringer, B., and Holmes, G. (2012). Experiment databases: a new way to share, organize and learn from experiments. Machine Learning, 87(2):127–158.
Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. (2014). OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this chapter
Cite this chapter
Brazdil, P., van Rijn, J.N., Soares, C., Vanschoren, J. (2022). Metadata Repositories. In: Metalearning. Cognitive Technologies. Springer, Cham. https://doi.org/10.1007/978-3-030-67024-5_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-67024-5_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67023-8
Online ISBN: 978-3-030-67024-5
eBook Packages: Computer ScienceComputer Science (R0)