Abstract
Data-driven modeling is an approach in energy systems modeling that has been gaining popularity. In data-driven modeling, machine learning methods such as linear regression, neural networks or decision-tree based methods are applied. While these methods do not require domain knowledge, they are sensitive to data quality. Therefore, improving data quality in a dataset is beneficial for creating machine learning-based models. The improvement of data quality can be implemented through preprocessing methods. A selected type of preprocessing is feature engineering, which focuses on evaluating and improving the quality of certain features inside the dataset. Feature engineering includes methods such as feature creation, feature expansion, or feature selection. In this work, a Python framework containing different feature engineering methods is presented. This framework contains different methods for feature creation, expansion and selection; in addition, methods for transforming or filtering data are implemented. The implementation of the framework is based on the Python library scikit-learn. The framework is demonstrated on a use case from energy demand prediction. A data-driven model is created including selected feature engineering methods. The results show an improvement in prediction accuracy through the engineered features.
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
Modeling and simulation is an crucial step in the design and optimization of energy systems. While traditional modeling methods rely on system parameters, a recent approach focuses on creating data-driven models based on measurement data taken from a system. Mainly, data-driven models are based on machine learning (ML) methods. ML methods can be classified into white-box and black-box ML methods based on their explainability [15]. For instance, Molnar et al. [15] classify methods such as linear or logistic regression and decision trees as white-box ML methods, and methods such as neural networks or decision tree ensembles as black-box ML methods.
A white-box ML model provides information about the underlying system for instance through its input-output relations (interpretability) or through a humanly comprehensible structure (explainability) [14]. To keep the structure of the model comprehensible, explainable models focus on a reduced complexity. As a result, their capability of modeling complex dependencies is often limited, creating a trade-off between accuracy and explainability [2].
To capture more complex dependencies using white-box ML models, methods of feature engineering are applied. The main purpose of feature engineering is to augment the existing dataset through adding new information, or expanding or reducing the feature set. In addition, the quality of a single feature can be improved, for instance through transformation or filtering [9].
The area of feature engineering covers a wide number of methods, such as feature creation, feature expansion [5] or feature selection [3]. Feature creation includes encodings of time-based features, such as cyclic features [19], or categorical encoding [11]. Similarly, feature expansion is the method of creating new features based on existing features. Feature expansion covers classical methods such as polynomial expansion [5] or spline interpolation [6]. In contrast to feature creation and expansion, feature selection aims to reduce the size of the feature set. While large feature sets may contain more information, high-dimensional feature sets may be subject to sparsity or multicollinearity. To address this, methods such as Principal Component Analysis (PCA) [10] reduce the feature set through transformation, while feature selection methods discard features. Feature selection can be implemented for instance through sequential methods, such as forward or backward selection or through correlation criteria [3]. including measures based on the Pearson Correlation Coefficient, as well as entropy-based criteria [1]. For correlation criteria, feature selection is mainly implemented through a threshold-based selection.
Mainly, the methods of feature engineering are applied during the first steps of creating a data-driven model, creating an engineered dataset to use for training [4]. However, feature engineering methods can also be used in combination with model selection procedures, such as grid search [1]. Feature engineering methods are widely used in applications from the energy domain, such as in prediction for building energy demand [20] or photovoltaic power prediction [4].
1.1 Main Contribution
To apply different feature engineering methods to the creation of data-driven models, a Python framework implementing differernt feature engineering methods was developed. The feature engineering framework is implemented in Python based on the scikit-learn framework and can be imported as a Python package. Compared with existing frameworks, the implemented framework focuses on providing a standardized function interface to allow the creation of different combined workflows with low effort. The functionality of the framework is demonstrated on a case study of an energy demand prediction use case. For this use case, a multi-step workflow consisting of a combination of feature engineering methods is created. The results of the case study show an improvement in prediction accuracy through the applied feature engineering workflow.
2 Method
The presented framework implements various feature engineering methods in Python based on the framework scikit-learn. The framework implements methods for feature creation and expansion, feature selection, as well as transformation and filtering operations. The source code of the framework is openly available on https://github.com/tug-cps/featureengineering.
Feature Creation and Expansion
In the framework, different methods for feature creation and expansion are implemented. These methods create new features from time values or from expansion of existing features. To create new features, the implemented framework supports categorical encoding and cyclic encoding of time-based values.
-
Cyclic Features: Cyclic features can be used to model time values through periodic functions [19]. In the implementation, sinusoidal signals \(x_{sin}, x_{cos}\) with a selected frequency f can be created based on a sample series n:
$$\begin{aligned} x_{sin}[n] = sin(2 \pi f n) \end{aligned}$$(1)$$\begin{aligned} x_{cos}[n] = cos(2 \pi f n) \end{aligned}$$(2)The implementation offers the creation of features with a zero-order hold function for a certain time period, for instance \(T_S = 1~day\) for a signal with a time period of \(T=1~week\).
-
Categorical Features: Categorical encoding creates a representation of discrete numerical values through a number of features with boolean values [11]. In this implementation, for a number of categorical features \(x_{0,....,N}\) for a feature x with discrete possible values \(v_{0,....,N}\), a single feature \(x_{i}\) is defined as:
$$\begin{aligned} x_{i} = {\left\{ \begin{array}{ll} 1 &{} x = v_{i}\\ 0 &{} else \end{array}\right. } \end{aligned}$$(3)The framework offers categorical encoding for time-based values as well as a division factor to create an encoding of a downsampled version of the time values.
-
Time-based Features: The framework implements a method of dynamic timeseries unrolling to create features \(x_{n-1}\), \(x_{n-2}\), ... \(x_{n-N}\) from an existing feature x. The method of dynamic timeseries unrolling is based on the research in [7]. In this implementation, dynamic timeseries unrolling is implemented through filter operations from the scipy.signal library. The dynamic features are created through the convolution of the signal x with a Kronecker delta for \(i = 1...N\):
$$\begin{aligned} x_{dyn,i}[n] = x[n] * \delta [n-i] \end{aligned}$$(4)This operation creates delayed signals \(x_{dyn,1},...,x_{dyn,N}\). In this implementation, zero padding is used for the samples in the delayed signals, for which no values are available.
Feature Selection
The framework offers several threshold-based feature selection methods, which analyze the input and target features based on a certain criterion, and then discard features with a low value of the criterion. A widely used criterion is the Pearson Correlation Coefficient, which is used to detect linear correlations between features [4]. The Pearson Correlation Coefficient calculates the correlation between two features for samples \(x_{0,....,N}, y_{0,...,N}\) with mean values \(\bar{x}\) and \(\bar{y}\):
In addition to the Pearson Correlation Coefficient, the framework provides thresholds based on non-linear dependency detection coefficients such as Maximum Information Coefficient (\(\textrm{MIC}\)) [17].
Transformation and Filtering Operations
To transform features, the framework implements the Box-cox transformation as well as the square root and inverse transformation. In addition, the framework provides filtering operations, which were applied in timeseries prediction for instance in [9]. Discrete-time based filters can be implemented in Python through the functions implemented in scipy.signal. The scipy.signal library offers functions for calculating the coefficients for different types of digital filters. A digital filter of order N can be defined through the transfer function H(z) in a direct form:
The filter coefficients \(a_i\) and \(b_i\) define the behavior of the filter. The framework implements topologies such as the Butterworth [9] or Chebyshev filter. In addition, an envelope detection filter was implemented for demodulation of modulated signals. The direct form filter classes of the framework offer a simple option for extension. Different architectures can be implemented by re-defining the implemented method for coefficient calculation. This allows to create filters with different Finite Impulse Response (FIR) or Infinite Impulse Response (IIR) structures.
3 Case Study
The framework was demonstrated on a use case from prediction for energy systems modeling. For this purpose, a mixed office-campus building was selected. A white-box prediction model should be trained based on existing measurement data provided by [18]. In the prediction of energy demand, different factors must be considered. Such factors include thermal characteristics and Heating, Ventilation, Air Conditioning and Cooling (HVAC) system behavior [13]. Additionally, building energy demand may be dependent on occupancy [8] or subject to seasonal trends [19]. Many of these factors show non-linear or dynamic behavior, which makes it difficult to address them through a purely linear model. Through feature engineering methods, these factors should be incorporated into the data-driven model.
Data-driven Model
For the selected application, a data-driven model of the building energy demand should be created. To demonstrate the effect of feature engineering, two models were trained based on the existing measurement data: a basic regression model and a regression model with engineered features. The energy demand was measured during a period from 05/2019 to 03/2020, with a sampling time of 1h [18]. The feature set consisted of temperature data as well as data of the registrations for lectures inside the building. In addition, time-based data such as day-night changes and public or university holidays were included. For the energy demand prediction, a linear regression architecture was selected due to its simplicity and explainability as a white-box ML model. Dynamic system behavior as well as seasonality of the underlying system should be incorporated through feature engineering. Finally, the implemented models were compared to a baseline neural network model.
The implemented workflow consisted of a combination of cyclic [19] and categorical features [16], which were used to model seasonal trends, as well as of data smoothing [9] and dynamic timeseries unrolling [12]. Finally, feature selection using the Pearson Correlation Coefficient was applied, similar to the method applied by Chen et al. [4]. An overview of the implemented workflow is depicted in Fig. 1.
The model training was performed with a train-test split of 0.8 and 5-fold cross-validation. For the model with engineered features, the parameters for the steps timeseries unrolling and feature selection were determined through a grid search based on the metrics Coefficient of Determination (\(R^2\)), mean squared error (\(\textrm{MSE}\)) and Mean Absolute Percentage Error (\(\textrm{MAPE}\)).
Experimental Results
The two models were trained on the measurement data and compared in terms of performance metrics \(R^2\), Coefficient of Variation of the Root Mean Square Error (\(\mathrm {CV{-}RMSE}\)) and \(\textrm{MAPE}\). Table 1 gives an overview of the performance metrics for the different models.
The performance metrics showed a significant improvement in prediction accuracy for the linear regression model through the engineered features. This observation could also be made from the timeseries analysis depicted in Fig. 2.
The timeseries analysis showed improvements in the linear regression model especially in the seasonal trends, such as day-night or weekly behavior of the energy demand. This improvement was attributed to the introduced features. While the categorical features provided information about daily or weekly trends, the day-night behavior was modeled through the cyclic features. The introduction of the delayed input features through dynamic timeseries unrolling provided information about short-time changes in the energy demand.
The introduction of the additional features modeling the seasonality and dynamics of the energy demand showed a significant accuracy improvement for the linear regression model. The results suggest that this approach shows promise for improving the accuracy of explainable linear models and could furthermore be applied to non-linear methods such as neural networks or decision trees.
4 Conclusion
This work presents a Python framework for feature engineering that provides different methods through a standardized interface. The framework is based on the scikit-learn Python package and offers classic feature engineering methods such as feature expansion, feature creation, feature selection or transformation and filter operations. The framework is implemented as a Python package. Through the defined interfaces of the framework, additional methods can be added with low effort. Finally, the framework is demonstrated on a case study of energy demand prediction, using a workflow created from a subset of the implemented methods for data-driven model creation.
Future Work
The current version of the framework gives many options for extensions. For instance, additional feature engineering methods can be added using the provided interfaces of the framework. In addition, combinations of the implemented feature engineering methods such as the implemented workflow can be used for prediction in different use cases.
References
Akay, M.F.: Support vector machines combined with feature selection for breast cancer diagnosis. Expert Syst. Appl. 36(2), 3240–3247 (2009)
Arrieta et al.: Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. en. Inf. Fusion 58, 82–115 (2020)
Cai, J. et al.: Feature selection in machine learning: A new perspective. Neurocomputing 300, 70–79 (2018)
Chen, H., Chang, X.: Photovoltaic power prediction of LSTM model based on pearson feature selection. In: Energy Reports. 2021 International Conference on Energy Engineering and Power Systems 7 (Nov. 2021), pp. 1047–1054
Cheng, X. et al. Polynomial regression as an alternative to neural nets. In: arXiv:1806.06850 [cs, stat] (2019)
Eilers, P.H.C., Marx, B.D.: Flexible smoothing with B-splines and penalties. Stat. Sci. 11(2) (1996)
Falay, B. et al.: Coupling physical and machine learning models: Case study of a single-family house. In: Modelica Conferences, pp. 335–341 (2021)
Ghofrani, A., Nazemi, S.D., Jafari, M.A.: Prediction of building indoor temperature response in variable air volume systems. en. J. Build. Perform. Simul. 13(1), 34–47 (2020)
Gómez, V.: The use of butterworth filters for trend and cycle estimation in economic time series. J. Bus. Econ. Stat. 19(3), 365–373 (2001)
Gupta, V., Mittal, M.: Respiratory signal analysis using PCA, FFT and ARTFA. In: 2016 International Conference on Electrical Power and Energy Systems (ICEPES), pp. 221–225 (2016)
Hancock, J.T., Khoshgoftaar, T.M.: Survey on categorical data for neural networks. J. Big Data 7(1), 28 (2020)
Kumar, U., Jain, V.K.: Time series models (Grey-Markov, Grey Model with rolling mechanism and singular spectrum analysis) to forecast energy consumption in India. en. Energy 35(4), 1709–1716 (2010)
Maccarini, A. et al.: Development of a Modelica-based simplified building model for district energy simulations. J. Phys. Conf. Ser. 2042(1), 012078 (2021)
Manfren, M., James, P.A.B., Tronchin, L.: Datadriven building energy modelling – An analysis of the potential for generalisation through interpretable machine learning. en. Renew. Sustain. Energy Rev. 167, 112686 (2022)
Molnar, C., Casalicchio, G., Bischl, B.: Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges (2020)
Potdar, K., Taher, S., Chinmay, D.: A comparative study of categorical variable encoding techniques for neural network classifiers. en. Int. J. Comput. Appl. 175(4), 7–9 (2017)
Reshef, Y.A. et al.: Measuring dependence powerfully and equitably. J. Mach. Learn. Res. 63 (2016)
Schranz, T. et al.: Energy prediction under changed demand conditions: Robust machine learning models and input feature combinations. In: Proceedings of the 17th International Conference of the International Building Performance Simulation Association (Building Simulation 2021) (2021)
Zhang, G. et al.: Accurate forecasting of building energy consumption via a novel ensembled deep learning method considering the cyclic feature. Energy 201, 117531 (2020)
Zheng, A., Casari, A.: Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists (1. edn.). O’Reilly Media, Inc. (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this paper
Cite this paper
Wilfling, S. (2024). Augmenting Explainable Data-Driven Models in Energy Systems: A Python Framework for Feature Engineering. In: Niggemann, O., Beyerer, J., Krantz, M., Kühnert, C. (eds) Machine Learning for Cyber-Physical Systems. ML4CPS 2023. Technologien für die intelligente Automation, vol 18. Springer, Cham. https://doi.org/10.1007/978-3-031-47062-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-47062-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47061-5
Online ISBN: 978-3-031-47062-2
eBook Packages: EngineeringEngineering (R0)