Augmenting Explainable Data-Driven Models in Energy Systems: A Python Framework for Feature Engineering

Wilfling, Sandra

doi:10.1007/978-3-031-47062-2_12

Sandra Wilfling⁶

Part of the book series: Technologien für die intelligente Automation ((TIA,volume 18))

Included in the following conference series:

International Conference on Machine Learning For Cyber-Physical Systems

547 Accesses

Abstract

Data-driven modeling is an approach in energy systems modeling that has been gaining popularity. In data-driven modeling, machine learning methods such as linear regression, neural networks or decision-tree based methods are applied. While these methods do not require domain knowledge, they are sensitive to data quality. Therefore, improving data quality in a dataset is beneficial for creating machine learning-based models. The improvement of data quality can be implemented through preprocessing methods. A selected type of preprocessing is feature engineering, which focuses on evaluating and improving the quality of certain features inside the dataset. Feature engineering includes methods such as feature creation, feature expansion, or feature selection. In this work, a Python framework containing different feature engineering methods is presented. This framework contains different methods for feature creation, expansion and selection; in addition, methods for transforming or filtering data are implemented. The implementation of the framework is based on the Python library scikit-learn. The framework is demonstrated on a use case from energy demand prediction. A data-driven model is created including selected feature engineering methods. The results show an improvement in prediction accuracy through the engineered features.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

Modeling and simulation is an crucial step in the design and optimization of energy systems. While traditional modeling methods rely on system parameters, a recent approach focuses on creating data-driven models based on measurement data taken from a system. Mainly, data-driven models are based on machine learning (ML) methods. ML methods can be classified into white-box and black-box ML methods based on their explainability [15]. For instance, Molnar et al. [15] classify methods such as linear or logistic regression and decision trees as white-box ML methods, and methods such as neural networks or decision tree ensembles as black-box ML methods.

A white-box ML model provides information about the underlying system for instance through its input-output relations (interpretability) or through a humanly comprehensible structure (explainability) [14]. To keep the structure of the model comprehensible, explainable models focus on a reduced complexity. As a result, their capability of modeling complex dependencies is often limited, creating a trade-off between accuracy and explainability [2].

To capture more complex dependencies using white-box ML models, methods of feature engineering are applied. The main purpose of feature engineering is to augment the existing dataset through adding new information, or expanding or reducing the feature set. In addition, the quality of a single feature can be improved, for instance through transformation or filtering [9].

The area of feature engineering covers a wide number of methods, such as feature creation, feature expansion [5] or feature selection [3]. Feature creation includes encodings of time-based features, such as cyclic features [19], or categorical encoding [11]. Similarly, feature expansion is the method of creating new features based on existing features. Feature expansion covers classical methods such as polynomial expansion [5] or spline interpolation [6]. In contrast to feature creation and expansion, feature selection aims to reduce the size of the feature set. While large feature sets may contain more information, high-dimensional feature sets may be subject to sparsity or multicollinearity. To address this, methods such as Principal Component Analysis (PCA) [10] reduce the feature set through transformation, while feature selection methods discard features. Feature selection can be implemented for instance through sequential methods, such as forward or backward selection or through correlation criteria [3]. including measures based on the Pearson Correlation Coefficient, as well as entropy-based criteria [1]. For correlation criteria, feature selection is mainly implemented through a threshold-based selection.

Mainly, the methods of feature engineering are applied during the first steps of creating a data-driven model, creating an engineered dataset to use for training [4]. However, feature engineering methods can also be used in combination with model selection procedures, such as grid search [1]. Feature engineering methods are widely used in applications from the energy domain, such as in prediction for building energy demand [20] or photovoltaic power prediction [4].

1.1 Main Contribution

To apply different feature engineering methods to the creation of data-driven models, a Python framework implementing differernt feature engineering methods was developed. The feature engineering framework is implemented in Python based on the scikit-learn framework and can be imported as a Python package. Compared with existing frameworks, the implemented framework focuses on providing a standardized function interface to allow the creation of different combined workflows with low effort. The functionality of the framework is demonstrated on a case study of an energy demand prediction use case. For this use case, a multi-step workflow consisting of a combination of feature engineering methods is created. The results of the case study show an improvement in prediction accuracy through the applied feature engineering workflow.

2 Method

The presented framework implements various feature engineering methods in Python based on the framework scikit-learn. The framework implements methods for feature creation and expansion, feature selection, as well as transformation and filtering operations. The source code of the framework is openly available on https://github.com/tug-cps/featureengineering.

Feature Creation and Expansion

In the framework, different methods for feature creation and expansion are implemented. These methods create new features from time values or from expansion of existing features. To create new features, the implemented framework supports categorical encoding and cyclic encoding of time-based values.

Cyclic Features: Cyclic features can be used to model time values through periodic functions [19]. In the implementation, sinusoidal signals $x_{sin}, x_{cos}$ with a selected frequency f can be created based on a sample series n:
$$\begin{aligned} x_{sin}[n] = sin(2 \pi f n) \end{aligned}$$
(1)

$$\begin{aligned} x_{cos}[n] = cos(2 \pi f n) \end{aligned}$$
(2)
The implementation offers the creation of features with a zero-order hold function for a certain time period, for instance $T_S = 1~day$ for a signal with a time period of $T=1~week$.
Categorical Features: Categorical encoding creates a representation of discrete numerical values through a number of features with boolean values [11]. In this implementation, for a number of categorical features $x_{0,....,N}$ for a feature x with discrete possible values $v_{0,....,N}$, a single feature $x_{i}$ is defined as:
$$\begin{aligned} x_{i} = {\left\{ \begin{array}{ll} 1 &{} x = v_{i}\\ 0 &{} else \end{array}\right. } \end{aligned}$$
(3)
The framework offers categorical encoding for time-based values as well as a division factor to create an encoding of a downsampled version of the time values.
Time-based Features: The framework implements a method of dynamic timeseries unrolling to create features $x_{n-1}$, $x_{n-2}$, ... $x_{n-N}$ from an existing feature x. The method of dynamic timeseries unrolling is based on the research in [7]. In this implementation, dynamic timeseries unrolling is implemented through filter operations from the scipy.signal library. The dynamic features are created through the convolution of the signal x with a Kronecker delta for $i = 1...N$:
$$\begin{aligned} x_{dyn,i}[n] = x[n] * \delta [n-i] \end{aligned}$$
(4)
This operation creates delayed signals $x_{dyn,1},...,x_{dyn,N}$. In this implementation, zero padding is used for the samples in the delayed signals, for which no values are available.

Feature Selection

The framework offers several threshold-based feature selection methods, which analyze the input and target features based on a certain criterion, and then discard features with a low value of the criterion. A widely used criterion is the Pearson Correlation Coefficient, which is used to detect linear correlations between features [4]. The Pearson Correlation Coefficient calculates the correlation between two features for samples $x_{0,....,N}, y_{0,...,N}$ with mean values $\bar{x}$ and $\bar{y}$:

$$\begin{aligned} r_{x,y} = \frac{\sum _{i=0}^{N}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum _{i=0}^N (x_i - \bar{x})^2\sum _{i=0}^N(y_i - \bar{y})^2}} \end{aligned}$$

(5)

In addition to the Pearson Correlation Coefficient, the framework provides thresholds based on non-linear dependency detection coefficients such as Maximum Information Coefficient ($\textrm{MIC}$) [17].

Transformation and Filtering Operations

To transform features, the framework implements the Box-cox transformation as well as the square root and inverse transformation. In addition, the framework provides filtering operations, which were applied in timeseries prediction for instance in [9]. Discrete-time based filters can be implemented in Python through the functions implemented in scipy.signal. The scipy.signal library offers functions for calculating the coefficients for different types of digital filters. A digital filter of order N can be defined through the transfer function H(z) in a direct form:

$$\begin{aligned} H(z) = \frac{\sum _{i=0}^{N} b_i z^{i}}{\sum _{i=0}^{N} a_i z^{i}} \end{aligned}$$

(6)

The filter coefficients $a_i$ and $b_i$ define the behavior of the filter. The framework implements topologies such as the Butterworth [9] or Chebyshev filter. In addition, an envelope detection filter was implemented for demodulation of modulated signals. The direct form filter classes of the framework offer a simple option for extension. Different architectures can be implemented by re-defining the implemented method for coefficient calculation. This allows to create filters with different Finite Impulse Response (FIR) or Infinite Impulse Response (IIR) structures.

3 Case Study

The framework was demonstrated on a use case from prediction for energy systems modeling. For this purpose, a mixed office-campus building was selected. A white-box prediction model should be trained based on existing measurement data provided by [18]. In the prediction of energy demand, different factors must be considered. Such factors include thermal characteristics and Heating, Ventilation, Air Conditioning and Cooling (HVAC) system behavior [13]. Additionally, building energy demand may be dependent on occupancy [8] or subject to seasonal trends [19]. Many of these factors show non-linear or dynamic behavior, which makes it difficult to address them through a purely linear model. Through feature engineering methods, these factors should be incorporated into the data-driven model.

Data-driven Model

For the selected application, a data-driven model of the building energy demand should be created. To demonstrate the effect of feature engineering, two models were trained based on the existing measurement data: a basic regression model and a regression model with engineered features. The energy demand was measured during a period from 05/2019 to 03/2020, with a sampling time of 1h [18]. The feature set consisted of temperature data as well as data of the registrations for lectures inside the building. In addition, time-based data such as day-night changes and public or university holidays were included. For the energy demand prediction, a linear regression architecture was selected due to its simplicity and explainability as a white-box ML model. Dynamic system behavior as well as seasonality of the underlying system should be incorporated through feature engineering. Finally, the implemented models were compared to a baseline neural network model.

The implemented workflow consisted of a combination of cyclic [19] and categorical features [16], which were used to model seasonal trends, as well as of data smoothing [9] and dynamic timeseries unrolling [12]. Finally, feature selection using the Pearson Correlation Coefficient was applied, similar to the method applied by Chen et al. [4]. An overview of the implemented workflow is depicted in Fig. 1.

A flow diagram. The basic feature set is followed by Butterworth filtering of selected features, dynamic time-series unrolling, extended feature set with cyclic features and categorial features, feature selection with Pearson correlation, and an engineered feature set. — **Fig. 1**

The model training was performed with a train-test split of 0.8 and 5-fold cross-validation. For the model with engineered features, the parameters for the steps timeseries unrolling and feature selection were determined through a grid search based on the metrics Coefficient of Determination ($R^2$), mean squared error ($\textrm{MSE}$) and Mean Absolute Percentage Error ($\textrm{MAPE}$).

Experimental Results

The two models were trained on the measurement data and compared in terms of performance metrics $R^2$, Coefficient of Variation of the Root Mean Square Error ($\mathrm {CV{-}RMSE}$) and $\textrm{MAPE}$. Table 1 gives an overview of the performance metrics for the different models.

Table 1 Performance metrics

Full size table

The performance metrics showed a significant improvement in prediction accuracy for the linear regression model through the engineered features. This observation could also be made from the timeseries analysis depicted in Fig. 2.

A multiline graph of consumption in kilowatt-hours versus time in days plots 3 overlapping lines with fluctuating peaks. The peaks for measurement value are higher than linear regression with engineered features followed by linear regression. The peaks are smaller on weekends. — **Fig. 2**

The timeseries analysis showed improvements in the linear regression model especially in the seasonal trends, such as day-night or weekly behavior of the energy demand. This improvement was attributed to the introduced features. While the categorical features provided information about daily or weekly trends, the day-night behavior was modeled through the cyclic features. The introduction of the delayed input features through dynamic timeseries unrolling provided information about short-time changes in the energy demand.

The introduction of the additional features modeling the seasonality and dynamics of the energy demand showed a significant accuracy improvement for the linear regression model. The results suggest that this approach shows promise for improving the accuracy of explainable linear models and could furthermore be applied to non-linear methods such as neural networks or decision trees.

4 Conclusion

This work presents a Python framework for feature engineering that provides different methods through a standardized interface. The framework is based on the scikit-learn Python package and offers classic feature engineering methods such as feature expansion, feature creation, feature selection or transformation and filter operations. The framework is implemented as a Python package. Through the defined interfaces of the framework, additional methods can be added with low effort. Finally, the framework is demonstrated on a case study of energy demand prediction, using a workflow created from a subset of the implemented methods for data-driven model creation.

Future Work

The current version of the framework gives many options for extensions. For instance, additional feature engineering methods can be added using the provided interfaces of the framework. In addition, combinations of the implemented feature engineering methods such as the implemented workflow can be used for prediction in different use cases.

References

Akay, M.F.: Support vector machines combined with feature selection for breast cancer diagnosis. Expert Syst. Appl. 36(2), 3240–3247 (2009)
Google Scholar
Arrieta et al.: Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. en. Inf. Fusion 58, 82–115 (2020)
Google Scholar
Cai, J. et al.: Feature selection in machine learning: A new perspective. Neurocomputing 300, 70–79 (2018)
Google Scholar
Chen, H., Chang, X.: Photovoltaic power prediction of LSTM model based on pearson feature selection. In: Energy Reports. 2021 International Conference on Energy Engineering and Power Systems 7 (Nov. 2021), pp. 1047–1054
Google Scholar
Cheng, X. et al. Polynomial regression as an alternative to neural nets. In: arXiv:1806.06850 [cs, stat] (2019)
Eilers, P.H.C., Marx, B.D.: Flexible smoothing with B-splines and penalties. Stat. Sci. 11(2) (1996)
Google Scholar
Falay, B. et al.: Coupling physical and machine learning models: Case study of a single-family house. In: Modelica Conferences, pp. 335–341 (2021)
Google Scholar
Ghofrani, A., Nazemi, S.D., Jafari, M.A.: Prediction of building indoor temperature response in variable air volume systems. en. J. Build. Perform. Simul. 13(1), 34–47 (2020)
Google Scholar
Gómez, V.: The use of butterworth filters for trend and cycle estimation in economic time series. J. Bus. Econ. Stat. 19(3), 365–373 (2001)
Google Scholar
Gupta, V., Mittal, M.: Respiratory signal analysis using PCA, FFT and ARTFA. In: 2016 International Conference on Electrical Power and Energy Systems (ICEPES), pp. 221–225 (2016)
Google Scholar
Hancock, J.T., Khoshgoftaar, T.M.: Survey on categorical data for neural networks. J. Big Data 7(1), 28 (2020)
Google Scholar
Kumar, U., Jain, V.K.: Time series models (Grey-Markov, Grey Model with rolling mechanism and singular spectrum analysis) to forecast energy consumption in India. en. Energy 35(4), 1709–1716 (2010)
Google Scholar
Maccarini, A. et al.: Development of a Modelica-based simplified building model for district energy simulations. J. Phys. Conf. Ser. 2042(1), 012078 (2021)
Google Scholar
Manfren, M., James, P.A.B., Tronchin, L.: Datadriven building energy modelling – An analysis of the potential for generalisation through interpretable machine learning. en. Renew. Sustain. Energy Rev. 167, 112686 (2022)
Google Scholar
Molnar, C., Casalicchio, G., Bischl, B.: Interpretable Machine Learning – A Brief History, State-of-the-Art and Challenges (2020)
Google Scholar
Potdar, K., Taher, S., Chinmay, D.: A comparative study of categorical variable encoding techniques for neural network classifiers. en. Int. J. Comput. Appl. 175(4), 7–9 (2017)
Google Scholar
Reshef, Y.A. et al.: Measuring dependence powerfully and equitably. J. Mach. Learn. Res. 63 (2016)
Google Scholar
Schranz, T. et al.: Energy prediction under changed demand conditions: Robust machine learning models and input feature combinations. In: Proceedings of the 17th International Conference of the International Building Performance Simulation Association (Building Simulation 2021) (2021)
Google Scholar
Zhang, G. et al.: Accurate forecasting of building energy consumption via a novel ensembled deep learning method considering the cyclic feature. Energy 201, 117531 (2020)
Google Scholar
Zheng, A., Casari, A.: Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists (1. edn.). O’Reilly Media, Inc. (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Software Technology, Graz University of Technology, Graz, Austria
Sandra Wilfling

Authors

Sandra Wilfling
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sandra Wilfling .

Editor information

Editors and Affiliations

Helmut Schmidt University, Hamburg, Germany
Oliver Niggemann
IOSB, Fraunhofer, Dielheim, Germany
Jürgen Beyerer
Fakultät für Maschinenbau, Helmut Schmidt University, Hamburg, Germany
Maria Krantz
Fraunhofer Institute of Optronics, Syst, Karlsruhe, Baden-Württemberg, Germany
Christian Kühnert

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wilfling, S. (2024). Augmenting Explainable Data-Driven Models in Energy Systems: A Python Framework for Feature Engineering. In: Niggemann, O., Beyerer, J., Krantz, M., Kühnert, C. (eds) Machine Learning for Cyber-Physical Systems. ML4CPS 2023. Technologien für die intelligente Automation, vol 18. Springer, Cham. https://doi.org/10.1007/978-3-031-47062-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-47062-2_12
Published: 21 June 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47061-5
Online ISBN: 978-3-031-47062-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

Augmenting Explainable Data-Driven Models in Energy Systems: A Python Framework for Feature Engineering