Keywords

1 Introduction

Modeling and simulation is an crucial step in the design and optimization of energy systems. While traditional modeling methods rely on system parameters, a recent approach focuses on creating data-driven models based on measurement data taken from a system. Mainly, data-driven models are based on machine learning (ML) methods. ML methods can be classified into white-box and black-box ML methods based on their explainability [15]. For instance, Molnar et al. [15] classify methods such as linear or logistic regression and decision trees as white-box ML methods, and methods such as neural networks or decision tree ensembles as black-box ML methods.

A white-box ML model provides information about the underlying system for instance through its input-output relations (interpretability) or through a humanly comprehensible structure (explainability) [14]. To keep the structure of the model comprehensible, explainable models focus on a reduced complexity. As a result, their capability of modeling complex dependencies is often limited, creating a trade-off between accuracy and explainability [2].

To capture more complex dependencies using white-box ML models, methods of feature engineering are applied. The main purpose of feature engineering is to augment the existing dataset through adding new information, or expanding or reducing the feature set. In addition, the quality of a single feature can be improved, for instance through transformation or filtering [9].

The area of feature engineering covers a wide number of methods, such as feature creation, feature expansion [5] or feature selection [3]. Feature creation includes encodings of time-based features, such as cyclic features [19], or categorical encoding [11]. Similarly, feature expansion is the method of creating new features based on existing features. Feature expansion covers classical methods such as polynomial expansion [5] or spline interpolation [6]. In contrast to feature creation and expansion, feature selection aims to reduce the size of the feature set. While large feature sets may contain more information, high-dimensional feature sets may be subject to sparsity or multicollinearity. To address this, methods such as Principal Component Analysis (PCA) [10] reduce the feature set through transformation, while feature selection methods discard features. Feature selection can be implemented for instance through sequential methods, such as forward or backward selection or through correlation criteria [3]. including measures based on the Pearson Correlation Coefficient, as well as entropy-based criteria [1]. For correlation criteria, feature selection is mainly implemented through a threshold-based selection.

Mainly, the methods of feature engineering are applied during the first steps of creating a data-driven model, creating an engineered dataset to use for training [4]. However, feature engineering methods can also be used in combination with model selection procedures, such as grid search [1]. Feature engineering methods are widely used in applications from the energy domain, such as in prediction for building energy demand [20] or photovoltaic power prediction [4].

1.1 Main Contribution

To apply different feature engineering methods to the creation of data-driven models, a Python framework implementing differernt feature engineering methods was developed. The feature engineering framework is implemented in Python based on the scikit-learn framework and can be imported as a Python package. Compared with existing frameworks, the implemented framework focuses on providing a standardized function interface to allow the creation of different combined workflows with low effort. The functionality of the framework is demonstrated on a case study of an energy demand prediction use case. For this use case, a multi-step workflow consisting of a combination of feature engineering methods is created. The results of the case study show an improvement in prediction accuracy through the applied feature engineering workflow.

2 Method

The presented framework implements various feature engineering methods in Python based on the framework scikit-learn. The framework implements methods for feature creation and expansion, feature selection, as well as transformation and filtering operations. The source code of the framework is openly available on https://github.com/tug-cps/featureengineering.

Feature Creation and Expansion

In the framework, different methods for feature creation and expansion are implemented. These methods create new features from time values or from expansion of existing features. To create new features, the implemented framework supports categorical encoding and cyclic encoding of time-based values.

  • Cyclic Features: Cyclic features can be used to model time values through periodic functions [19]. In the implementation, sinusoidal signals \(x_{sin}, x_{cos}\) with a selected frequency f can be created based on a sample series n:

    $$\begin{aligned} x_{sin}[n] = sin(2 \pi f n) \end{aligned}$$
    (1)
    $$\begin{aligned} x_{cos}[n] = cos(2 \pi f n) \end{aligned}$$
    (2)

    The implementation offers the creation of features with a zero-order hold function for a certain time period, for instance \(T_S = 1~day\) for a signal with a time period of \(T=1~week\).

  • Categorical Features: Categorical encoding creates a representation of discrete numerical values through a number of features with boolean values [11]. In this implementation, for a number of categorical features \(x_{0,....,N}\) for a feature x with discrete possible values \(v_{0,....,N}\), a single feature \(x_{i}\) is defined as:

    $$\begin{aligned} x_{i} = {\left\{ \begin{array}{ll} 1 &{} x = v_{i}\\ 0 &{} else \end{array}\right. } \end{aligned}$$
    (3)

    The framework offers categorical encoding for time-based values as well as a division factor to create an encoding of a downsampled version of the time values.

  • Time-based Features: The framework implements a method of dynamic timeseries unrolling to create features \(x_{n-1}\), \(x_{n-2}\), ... \(x_{n-N}\) from an existing feature x. The method of dynamic timeseries unrolling is based on the research in [7]. In this implementation, dynamic timeseries unrolling is implemented through filter operations from the scipy.signal library. The dynamic features are created through the convolution of the signal x with a Kronecker delta for \(i = 1...N\):

    $$\begin{aligned} x_{dyn,i}[n] = x[n] * \delta [n-i] \end{aligned}$$
    (4)

    This operation creates delayed signals \(x_{dyn,1},...,x_{dyn,N}\). In this implementation, zero padding is used for the samples in the delayed signals, for which no values are available.

Feature Selection

The framework offers several threshold-based feature selection methods, which analyze the input and target features based on a certain criterion, and then discard features with a low value of the criterion. A widely used criterion is the Pearson Correlation Coefficient, which is used to detect linear correlations between features [4]. The Pearson Correlation Coefficient calculates the correlation between two features for samples \(x_{0,....,N}, y_{0,...,N}\) with mean values \(\bar{x}\) and \(\bar{y}\):

$$\begin{aligned} r_{x,y} = \frac{\sum _{i=0}^{N}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum _{i=0}^N (x_i - \bar{x})^2\sum _{i=0}^N(y_i - \bar{y})^2}} \end{aligned}$$
(5)

In addition to the Pearson Correlation Coefficient, the framework provides thresholds based on non-linear dependency detection coefficients such as Maximum Information Coefficient (\(\textrm{MIC}\)) [17].

Transformation and Filtering Operations

To transform features, the framework implements the Box-cox transformation as well as the square root and inverse transformation. In addition, the framework provides filtering operations, which were applied in timeseries prediction for instance in [9]. Discrete-time based filters can be implemented in Python through the functions implemented in scipy.signal. The scipy.signal library offers functions for calculating the coefficients for different types of digital filters. A digital filter of order N can be defined through the transfer function H(z) in a direct form:

$$\begin{aligned} H(z) = \frac{\sum _{i=0}^{N} b_i z^{i}}{\sum _{i=0}^{N} a_i z^{i}} \end{aligned}$$
(6)

The filter coefficients \(a_i\) and \(b_i\) define the behavior of the filter. The framework implements topologies such as the Butterworth [9] or Chebyshev filter. In addition, an envelope detection filter was implemented for demodulation of modulated signals. The direct form filter classes of the framework offer a simple option for extension. Different architectures can be implemented by re-defining the implemented method for coefficient calculation. This allows to create filters with different Finite Impulse Response (FIR) or Infinite Impulse Response (IIR) structures.

3 Case Study

The framework was demonstrated on a use case from prediction for energy systems modeling. For this purpose, a mixed office-campus building was selected. A white-box prediction model should be trained based on existing measurement data provided by [18]. In the prediction of energy demand, different factors must be considered. Such factors include thermal characteristics and Heating, Ventilation, Air Conditioning and Cooling (HVAC) system behavior [13]. Additionally, building energy demand may be dependent on occupancy [8] or subject to seasonal trends [19]. Many of these factors show non-linear or dynamic behavior, which makes it difficult to address them through a purely linear model. Through feature engineering methods, these factors should be incorporated into the data-driven model.

Data-driven Model

For the selected application, a data-driven model of the building energy demand should be created. To demonstrate the effect of feature engineering, two models were trained based on the existing measurement data: a basic regression model and a regression model with engineered features. The energy demand was measured during a period from 05/2019 to 03/2020, with a sampling time of 1h [18]. The feature set consisted of temperature data as well as data of the registrations for lectures inside the building. In addition, time-based data such as day-night changes and public or university holidays were included. For the energy demand prediction, a linear regression architecture was selected due to its simplicity and explainability as a white-box ML model. Dynamic system behavior as well as seasonality of the underlying system should be incorporated through feature engineering. Finally, the implemented models were compared to a baseline neural network model.

The implemented workflow consisted of a combination of cyclic [19] and categorical features [16], which were used to model seasonal trends, as well as of data smoothing [9] and dynamic timeseries unrolling [12]. Finally, feature selection using the Pearson Correlation Coefficient was applied, similar to the method applied by Chen et al. [4]. An overview of the implemented workflow is depicted in Fig. 1.

Fig. 1
A flow diagram. The basic feature set is followed by Butterworth filtering of selected features, dynamic time-series unrolling, extended feature set with cyclic features and categorial features, feature selection with Pearson correlation, and an engineered feature set.

Implemented workflow

The model training was performed with a train-test split of 0.8 and 5-fold cross-validation. For the model with engineered features, the parameters for the steps timeseries unrolling and feature selection were determined through a grid search based on the metrics Coefficient of Determination (\(R^2\)), mean squared error (\(\textrm{MSE}\)) and Mean Absolute Percentage Error (\(\textrm{MAPE}\)).

Experimental Results

The two models were trained on the measurement data and compared in terms of performance metrics \(R^2\), Coefficient of Variation of the Root Mean Square Error (\(\mathrm {CV{-}RMSE}\)) and \(\textrm{MAPE}\). Table 1 gives an overview of the performance metrics for the different models.

Table 1 Performance metrics

The performance metrics showed a significant improvement in prediction accuracy for the linear regression model through the engineered features. This observation could also be made from the timeseries analysis depicted in Fig. 2.

Fig. 2
A multiline graph of consumption in kilowatt-hours versus time in days plots 3 overlapping lines with fluctuating peaks. The peaks for measurement value are higher than linear regression with engineered features followed by linear regression. The peaks are smaller on weekends.

Timeseries analysis for linear regression model period of 25 d from test set

The timeseries analysis showed improvements in the linear regression model especially in the seasonal trends, such as day-night or weekly behavior of the energy demand. This improvement was attributed to the introduced features. While the categorical features provided information about daily or weekly trends, the day-night behavior was modeled through the cyclic features. The introduction of the delayed input features through dynamic timeseries unrolling provided information about short-time changes in the energy demand.

The introduction of the additional features modeling the seasonality and dynamics of the energy demand showed a significant accuracy improvement for the linear regression model. The results suggest that this approach shows promise for improving the accuracy of explainable linear models and could furthermore be applied to non-linear methods such as neural networks or decision trees.

4 Conclusion

This work presents a Python framework for feature engineering that provides different methods through a standardized interface. The framework is based on the scikit-learn Python package and offers classic feature engineering methods such as feature expansion, feature creation, feature selection or transformation and filter operations. The framework is implemented as a Python package. Through the defined interfaces of the framework, additional methods can be added with low effort. Finally, the framework is demonstrated on a case study of energy demand prediction, using a workflow created from a subset of the implemented methods for data-driven model creation.

Future Work

The current version of the framework gives many options for extensions. For instance, additional feature engineering methods can be added using the provided interfaces of the framework. In addition, combinations of the implemented feature engineering methods such as the implemented workflow can be used for prediction in different use cases.