Abstract
It has been observed that, in data science, a great part of the effort usually goes into various preparatory steps that precede model-building. The aim of this chapter is to focus on some of these steps. A comprehensive description of a given task to be resolved is usually supplied by the domain expert. Techniques exist that can process natural language description to obtain task descriptors (e.g., keywords), determine the task type, the domain, and the goals. This in turn can be used to search for the required domain-specific knowledge appropriate for the given task. In some situations, the data required may not be available and a plan needs to be elaborated regarding how to get it. Although not much research has been done in this area so far, we expect that progress will be made in the future. In contrast to this, the area of preprocessing and transformation has been explored by various researchers. Methods exist for selection of instances and/or elimination of outliers, discretization and other kinds of transformations. This area is sometimes referred to as data wrangling. These transformations can be learned by exploiting existing machine learning techniques (e.g., learning by demonstration). The final part of this chapter discusses decisions regarding the appropriate level of detail (granularity) to be used in a given task. Although it is foreseeable that further progress could be made in this area, more work is needed to determine how to do this effectively.
Chapter PDF
Similar content being viewed by others
References
Abdulrahman, S. M., Cachada, M. V., and Brazdil, P. (2018). Impact of feature selection on average ranking method via metalearning. In European Congress on Computational Methods in Applied Sciences and Engineering, 6th ECCOMAS Thematic Conference on Computational Vision and Medical Image Processing (VipIMAGE 2017), pages 1091–1101. Springer.
Berti-Equille, L. (2019). Learn2clean: Optimizing the sequence of tasks for web data preparation. In The World Wide Web Conference, page 2580–2586. ACM, NY, USA.
Bie, D., De Raedt, L., and Hernandez-Orallo, J., editors (2019). ECMLPKDD Workshop on Automating Data Science (ADS), Würzburg, Germany. https://sites.google.com/view/autods.
Bilalli, B., Abelló, A., Aluja-Banet, T., and Wrembel, R. (2018). Intelligent assistance for data pre-processing. Computer Standards & Interf., 57:101–109.
Bilalli, B., Abelló, A., Aluja-Banet, T., and Wrembel, R. (2019). PRESISTANT: Learning based assistant for data pre-processing. Data & Knowledge Engineering, 123.
Brazdil, P. (1981). Model of Error Detection and Correction. PhD thesis, University of Edinburgh.
Breunig, M., Kriegel, H.-P., Ng, R., and Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the MOD 2000. ACM.
Cachada, M., Abdulrahman, S., and Brazdil, P. (2017). Combining feature and algorithm hyperparameter selection using some metalearning methods. In Proc. of Workshop AutoML 2017, CEUR Proceedings Vol-1998, pages 75–87.
Ceritli, T., Williams, C. K., and Geddes, J. (2020). ptype: probabilistic type inference. Data Mining and Knowledge Discovery, pages 1–35.
Charnay, C. (2016). Enhancing supervised learning with complex aggregate features and context sensitivity. PhD thesis, Université de Strasbourg, Artificial Intelligence.
Chu, X., Ilyas, I., Krishnan, S., and Wang, J. (2016). Data cleaning: Overview and emerging challenges. In Proceedings of the International Conference on Management of Data, SIGMOD ’16, page 2201–2206.
Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., MartÃnez-Plumed, F., RamÃrez-Quintana, M. J., and Katayama, S. (2019). Automated data transformation with inductive programming and dynamic background knowledge. In Proceedings of ECML PKDD 2019 Conference.
Dasu, T. and Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., NY, USA.
De Raedt, L., Blockeel, H., Kolb, S., Kolb, S., and Verbruggen, G. (2018). Elements of an automatic data scientist. In Proc. of the Advances in Intelligent Data Analysis XVII, IDA 2018, volume 11191 of LNCS. Springer.
Eduardo, S., Nazábal, A., Williams, C. K., and Sutton, C. (2020). Robust variational autoencoders for outlier detection and repair of mixed-type data. In International Conference on Artificial Intelligence and Statistics, pages 4056–4066.
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. (2015). Efficient and robust automated machine learning. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, NIPS’15, pages 2962–2970. Curran Associates, Inc.
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., Blum, M., and Hutter, F. (2019). Auto-sklearn: Efficient and robust automated machine learning. In Hutter, F., Kotthoff, L., and Vanschoren, J., editors, Automated Machine Learning: Methods, Systems, Challenges, pages 113–134. Springer.
Gray, J., Bosworth, A., Layman, A., and Pirahesh, H. (2002). Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. In Proceedings of the International Conference on Data Engineering (ICDE), page 152–159.
Gulwani, S., Harris, W., and Singh, R. (2012). Spreadsheet data manipulation using examples. Communications of the ACM, 55(8):97–105.
Gulwani, S., Hernandez-Orallo, J., Kitzelmann, E., Muggleton, S., Schmid, U., and Zorn, B. (2015). Inductive programming meets the real world. Communications of the ACM, 58(11):90–99.
He, Y., Chu, X., Ganjam, K., Zheng, Y., Narasayya, V., and Chaudhuri, S. (2018). Transform-data-by-example (TDE): an extensible search engine for data transformations. In Proceedings of the VLDB Endowment, pages 1165–1177.
Hofmann, M. and Klinkenberg, R. (2013). RapidMiner: Data Mining Use Cases and Business Analytics Applications. Data Mining and Knowledge Discovery. Chapman & Hall/CRC.
Hunter, L. and Ram, A. (1992a). Goals for learning and understanding. Applied Intelligence, 2(1):47–73.
Hunter, L. and Ram, A. (1992b). The use of explicit goals for knowledge to guide inference and learning. In Proceedings of the Eighth International Workshop on Machine Learning (ML’91), pages 265–269, San Mateo, CA, USA. Morgan Kaufmann.
Hunter, L. and Ram, A. (1995). Planning to learn. In Ram, A. and Leake, D. B., editors, Goal-Driven Learning. MIT Press.
Jin, Z., Anderson, M. R., Cafarella, M., and Jagadish, H. V. (2017). Foofah: A programming-by-example system for synthesizing data transformation programs. In Proc. of the International Conference on Management of Data, SIGMOD ’17, page 1607–1610.
John, G. H. (1995). Robust decision trees: Removing outliers from databases. In Knowledge Discovery and Data Mining, pages 174–179. AAAI Press.
Kandel, S., Paepcke, A., Hellerstein, J., and Heer, J. (2011). Wrangler: Interactive visual specification of data transformation scripts. In CHI ’11, Proceedings of SIGCHI Conference on Human Factors in Computing Systems, page 3363–3372.
King, R. D., Rowland, J., Oliver, S. G., Young, M., Aubrey, W., Byrne, E., Liakata, M., Markham, M., Pir, P., Soldatova, L. N., Sparkes, A., Whelan, K. E., and Clare, A. (2009). The automation of science. Science, 324(5923):85–89.
Krishnan, S., Wang, J., Wu, E., Franklin, M., and Goldberg, K. (2016). ActiveClean: Interactive data cleaning for statistical modeling. PVLDB, 9(12):948–959.
Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2012). Isolation-based anomaly detection. ACM Trans. Knowl. Discov. Data, 6(1):3:1–3:39.
Michalski, R. (1994). Inferential theory of learning: Developing foundations for multistrategy learning. In Michalski, R. and Tecuci, G., editors, Machine Learning: A Multistrategy Approach, Volume IV, chapter 1, pages 3–62. Morgan Kaufmann.
Minsky, M. (1975). A framework for representing knowledge. In Winston, P. H., editor, The Psychology of Computer Vision, pages 211–277. McGraw-Hill.
Mitchell, T. (1977). Version spaces: A candidate elimination approach to rule learning. PhD thesis, Electrical Engineering Department, Stanford University.
Mitchell, T., Utgoff, P., and Banerji, R. (1983). Learning by experimentation: Acquiring and refining problem-solving heuristics. In Michalski, R., Carbonell, J., and Mitchell, T., editors, Machine Learning. Symbolic Computation, pages 163–190. Tioga.
Quemy, A. (2019). Data pipeline selection and optimization. In Proc. of the Int. Workshop on Design, Optimization, Languages and Analytical Processing of Big Data, DOLAP ’19.
Ram, A. and Leake, D. B., editors (2005). Goal Driven Learning. MIT Press.
Schoenfeld, B., Giraud-Carrier, C., M. Poggeman, J. C., and Seppi, K. (2018). Feature selection for high-dimensional data: A fast correlation-based filter solution. In Workshop AutoML 2018 @ ICML/IJCAI-ECAI. Available at site https://sites.google.com/site/automl2018icml/accepted-papers.
Smith, M. R. and Martinez, T. R. (2018). The robustness of majority voting compared to filtering misclassified instances in supervised classification tasks. Artif. Intell. Rev., 49(1):105–130.
Srinivasan, A., King, R. D., and Muggleton, S. H. (1996). The role of background knowledge: using a problem from chemistry to examine the performance of an ILP program.
Steinruecken, C., Smith, E., Janz, D., Lloyd, J., and Ghahramani, Z. (2019). The Automatic Statistician. In Hutter, F., Kotthoff, L., and Vanschoren, J., editors, Automated Machine Learning, Series on Challenges in Machine Learning. Springer.
Stepp, R. S. and Michalski, R. S. (1983). How to structure structured objects. In Proceedings of the International Workshop on Machine Learning, Urbana, IL, USA.
Stone, P. (2000). Layered Learning in Multiagent Systems: A Winning Approach to Robotic Soccer. MIT Press.
The Alan Turing Institute (2020). Artificial intelligence for data analytics (AIDA). https://www.turing.ac.uk/research/research-projects/artificial-intelligencedata-analytics-aida.
Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics, 6:448–452.
Valera, I. and Ghahramani, Z. (2017). Automatic discovery of the statistical types of variables in a dataset. volume 70 of Proceedings of Machine Learning Research, pages 3521–3529, International Convention Centre, Sydney, Australia. PMLR.
Van Wolputte, E., Korneva, E., and Blockeel, H. (2018). Mercs: multi-directional ensembles of regression and classification trees. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16875/16735, pages 4276–4283.
Wilson, D. and Martinez, T. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257–286.
Wygotski, L. S. (1962). Thought and Language. MIT Press.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this chapter
Cite this chapter
Brazdil, P., van Rijn, J.N., Soares, C., Vanschoren, J. (2022). Automating Data Science. In: Metalearning. Cognitive Technologies. Springer, Cham. https://doi.org/10.1007/978-3-030-67024-5_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-67024-5_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67023-8
Online ISBN: 978-3-030-67024-5
eBook Packages: Computer ScienceComputer Science (R0)