Automating Data Science

Brazdil, Pavel; van Rijn, Jan N.; Soares, Carlos; Vanschoren, Joaquin

doi:10.1007/978-3-030-67024-5_14

Pavel Brazdil⁶,
Jan N. van Rijn⁷,
Carlos Soares⁸ &
…
Joaquin Vanschoren⁹

Part of the book series: Cognitive Technologies ((COGTECH))

12k Accesses
1 Citations

Abstract

It has been observed that, in data science, a great part of the effort usually goes into various preparatory steps that precede model-building. The aim of this chapter is to focus on some of these steps. A comprehensive description of a given task to be resolved is usually supplied by the domain expert. Techniques exist that can process natural language description to obtain task descriptors (e.g., keywords), determine the task type, the domain, and the goals. This in turn can be used to search for the required domain-specific knowledge appropriate for the given task. In some situations, the data required may not be available and a plan needs to be elaborated regarding how to get it. Although not much research has been done in this area so far, we expect that progress will be made in the future. In contrast to this, the area of preprocessing and transformation has been explored by various researchers. Methods exist for selection of instances and/or elimination of outliers, discretization and other kinds of transformations. This area is sometimes referred to as data wrangling. These transformations can be learned by exploiting existing machine learning techniques (e.g., learning by demonstration). The final part of this chapter discusses decisions regarding the appropriate level of detail (granularity) to be used in a given task. Although it is foreseeable that further progress could be made in this area, more work is needed to determine how to do this effectively.

Download to read the full chapter text

Chapter PDF

Solutions to Data Science Problems

Data Science: An Introduction

Learned Data Structures

References

Abdulrahman, S. M., Cachada, M. V., and Brazdil, P. (2018). Impact of feature selection on average ranking method via metalearning. In European Congress on Computational Methods in Applied Sciences and Engineering, 6th ECCOMAS Thematic Conference on Computational Vision and Medical Image Processing (VipIMAGE 2017), pages 1091–1101. Springer.
Google Scholar
Berti-Equille, L. (2019). Learn2clean: Optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference, page 2580–2586. ACM, NY, USA.
Google Scholar
Bie, D., De Raedt, L., and Hernandez-Orallo, J., editors (2019). ECMLPKDD Workshop on Automating Data Science (ADS), Würzburg, Germany. https://sites.google.com/view/autods.
Bilalli, B., Abelló, A., Aluja-Banet, T., and Wrembel, R. (2018). Intelligent assistance for data pre-processing. Computer Standards & Interf., 57:101–109.
Google Scholar
Bilalli, B., Abelló, A., Aluja-Banet, T., and Wrembel, R. (2019). PRESISTANT: Learning based assistant for data pre-processing. Data & Knowledge Engineering, 123.
Google Scholar
Brazdil, P. (1981). Model of Error Detection and Correction. PhD thesis, University of Edinburgh.
Google Scholar
Breunig, M., Kriegel, H.-P., Ng, R., and Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the MOD 2000. ACM.
Google Scholar
Cachada, M., Abdulrahman, S., and Brazdil, P. (2017). Combining feature and algorithm hyperparameter selection using some metalearning methods. In Proc. of Workshop AutoML 2017, CEUR Proceedings Vol-1998, pages 75–87.
Google Scholar
Ceritli, T., Williams, C. K., and Geddes, J. (2020). ptype: probabilistic type inference. Data Mining and Knowledge Discovery, pages 1–35.
Google Scholar
Charnay, C. (2016). Enhancing supervised learning with complex aggregate features and context sensitivity. PhD thesis, Université de Strasbourg, Artificial Intelligence.
Google Scholar
Chu, X., Ilyas, I., Krishnan, S., and Wang, J. (2016). Data cleaning: Overview and emerging challenges. In Proceedings of the International Conference on Management of Data, SIGMOD ’16, page 2201–2206.
Google Scholar
Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M. J., and Katayama, S. (2019). Automated data transformation with inductive programming and dynamic background knowledge. In Proceedings of ECML PKDD 2019 Conference.
Google Scholar
Dasu, T. and Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., NY, USA.
Google Scholar
De Raedt, L., Blockeel, H., Kolb, S., Kolb, S., and Verbruggen, G. (2018). Elements of an automatic data scientist. In Proc. of the Advances in Intelligent Data Analysis XVII, IDA 2018, volume 11191 of LNCS. Springer.
Google Scholar
Eduardo, S., Nazábal, A., Williams, C. K., and Sutton, C. (2020). Robust variational autoencoders for outlier detection and repair of mixed-type data. In International Conference on Artificial Intelligence and Statistics, pages 4056–4066.
Google Scholar
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. (2015). Efficient and robust automated machine learning. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, NIPS’15, pages 2962–2970. Curran Associates, Inc.
Google Scholar
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., Blum, M., and Hutter, F. (2019). Auto-sklearn: Efficient and robust automated machine learning. In Hutter, F., Kotthoff, L., and Vanschoren, J., editors, Automated Machine Learning: Methods, Systems, Challenges, pages 113–134. Springer.
Google Scholar
Gray, J., Bosworth, A., Layman, A., and Pirahesh, H. (2002). Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. In Proceedings of the International Conference on Data Engineering (ICDE), page 152–159.
Google Scholar
Gulwani, S., Harris, W., and Singh, R. (2012). Spreadsheet data manipulation using examples. Communications of the ACM, 55(8):97–105.
Google Scholar
Gulwani, S., Hernandez-Orallo, J., Kitzelmann, E., Muggleton, S., Schmid, U., and Zorn, B. (2015). Inductive programming meets the real world. Communications of the ACM, 58(11):90–99.
Google Scholar
He, Y., Chu, X., Ganjam, K., Zheng, Y., Narasayya, V., and Chaudhuri, S. (2018). Transform-data-by-example (TDE): an extensible search engine for data transformations. In Proceedings of the VLDB Endowment, pages 1165–1177.
Google Scholar
Hofmann, M. and Klinkenberg, R. (2013). RapidMiner: Data Mining Use Cases and Business Analytics Applications. Data Mining and Knowledge Discovery. Chapman & Hall/CRC.
Google Scholar
Hunter, L. and Ram, A. (1992a). Goals for learning and understanding. Applied Intelligence, 2(1):47–73.
Google Scholar
Hunter, L. and Ram, A. (1992b). The use of explicit goals for knowledge to guide inference and learning. In Proceedings of the Eighth International Workshop on Machine Learning (ML’91), pages 265–269, San Mateo, CA, USA. Morgan Kaufmann.
Google Scholar
Hunter, L. and Ram, A. (1995). Planning to learn. In Ram, A. and Leake, D. B., editors, Goal-Driven Learning. MIT Press.
Google Scholar
Jin, Z., Anderson, M. R., Cafarella, M., and Jagadish, H. V. (2017). Foofah: A programming-by-example system for synthesizing data transformation programs. In Proc. of the International Conference on Management of Data, SIGMOD ’17, page 1607–1610.
Google Scholar
John, G. H. (1995). Robust decision trees: Removing outliers from databases. In Knowledge Discovery and Data Mining, pages 174–179. AAAI Press.
Google Scholar
Kandel, S., Paepcke, A., Hellerstein, J., and Heer, J. (2011). Wrangler: Interactive visual specification of data transformation scripts. In CHI ’11, Proceedings of SIGCHI Conference on Human Factors in Computing Systems, page 3363–3372.
Google Scholar
King, R. D., Rowland, J., Oliver, S. G., Young, M., Aubrey, W., Byrne, E., Liakata, M., Markham, M., Pir, P., Soldatova, L. N., Sparkes, A., Whelan, K. E., and Clare, A. (2009). The automation of science. Science, 324(5923):85–89.
Google Scholar
Krishnan, S., Wang, J., Wu, E., Franklin, M., and Goldberg, K. (2016). ActiveClean: Interactive data cleaning for statistical modeling. PVLDB, 9(12):948–959.
Google Scholar
Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2012). Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data, 6(1):3:1–3:39.
Google Scholar
Michalski, R. (1994). Inferential theory of learning: Developing foundations for multistrategy learning. In Michalski, R. and Tecuci, G., editors, Machine Learning: A Multistrategy Approach, Volume IV, chapter 1, pages 3–62. Morgan Kaufmann.
Google Scholar
Minsky, M. (1975). A framework for representing knowledge. In Winston, P. H., editor, The Psychology of Computer Vision, pages 211–277. McGraw-Hill.
Google Scholar
Mitchell, T. (1977). Version spaces: A candidate elimination approach to rule learning. PhD thesis, Electrical Engineering Department, Stanford University.
Google Scholar
Mitchell, T., Utgoff, P., and Banerji, R. (1983). Learning by experimentation: Acquiring and refining problem-solving heuristics. In Michalski, R., Carbonell, J., and Mitchell, T., editors, Machine Learning. Symbolic Computation, pages 163–190. Tioga.
Google Scholar
Quemy, A. (2019). Data pipeline selection and optimization. In Proc. of the Int. Workshop on Design, Optimization, Languages and Analytical Processing of Big Data, DOLAP ’19.
Google Scholar
Ram, A. and Leake, D. B., editors (2005). Goal Driven Learning. MIT Press.
Google Scholar
Schoenfeld, B., Giraud-Carrier, C., M. Poggeman, J. C., and Seppi, K. (2018). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Workshop AutoML 2018 @ ICML/IJCAI-ECAI. Available at site https://sites.google.com/site/automl2018icml/accepted-papers.
Smith, M. R. and Martinez, T. R. (2018). The robustness of majority voting compared to filtering misclassified instances in supervised classification tasks. Artif. Intell. Rev., 49(1):105–130.
Google Scholar
Srinivasan, A., King, R. D., and Muggleton, S. H. (1996). The role of background knowledge: using a problem from chemistry to examine the performance of an ILP program.
Google Scholar
Steinruecken, C., Smith, E., Janz, D., Lloyd, J., and Ghahramani, Z. (2019). The Automatic Statistician. In Hutter, F., Kotthoff, L., and Vanschoren, J., editors, Automated Machine Learning, Series on Challenges in Machine Learning. Springer.
Google Scholar
Stepp, R. S. and Michalski, R. S. (1983). How to structure structured objects. In Proceedings of the International Workshop on Machine Learning, Urbana, IL, USA.
Google Scholar
Stone, P. (2000). Layered Learning in Multiagent Systems: A Winning Approach to Robotic Soccer. MIT Press.
Google Scholar
The Alan Turing Institute (2020). Artificial intelligence for data analytics (AIDA). https://www.turing.ac.uk/research/research-projects/artificial-intelligencedata-analytics-aida.
Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics, 6:448–452.
Google Scholar
Valera, I. and Ghahramani, Z. (2017). Automatic discovery of the statistical types of variables in a dataset. volume 70 of Proceedings of Machine Learning Research, pages 3521–3529, International Convention Centre, Sydney, Australia. PMLR.
Google Scholar
Van Wolputte, E., Korneva, E., and Blockeel, H. (2018). Mercs: multi-directional ensembles of regression and classification trees. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16875/16735, pages 4276–4283.
Wilson, D. and Martinez, T. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257–286.
Google Scholar
Wygotski, L. S. (1962). Thought and Language. MIT Press.
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratory of Artificial Intelligence and Decision Support, University of Porto, Porto, Portugal
Pavel Brazdil
Leiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands
Jan N. van Rijn
Porto Business School, Porto, Portugal
Carlos Soares
Department of Mathematics and Computer Science, Technische Universiteit Eindhoven, Eindhoven, The Netherlands
Joaquin Vanschoren

Authors

Pavel Brazdil
View author publications
You can also search for this author in PubMed Google Scholar
Jan N. van Rijn
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Soares
View author publications
You can also search for this author in PubMed Google Scholar
Joaquin Vanschoren
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavel Brazdil .

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Brazdil, P., van Rijn, J.N., Soares, C., Vanschoren, J. (2022). Automating Data Science. In: Metalearning. Cognitive Technologies. Springer, Cham. https://doi.org/10.1007/978-3-030-67024-5_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-67024-5_14
Published: 22 February 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67023-8
Online ISBN: 978-3-030-67024-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automating Data Science

Abstract

Chapter PDF

Similar content being viewed by others

Solutions to Data Science Problems

Data Science: An Introduction

Learned Data Structures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Automating Data Science

Abstract

Chapter PDF

Similar content being viewed by others

Solutions to Data Science Problems

Data Science: An Introduction

Learned Data Structures

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation