Introduction

Climate change (CC) [1, 2] has emerged as a global issue of paramount importance due to its devastating impact on the existence of all forms of life on Earth. It can pose significant threats to biodiversity loss as well as human health and food production [3] owing to increased heat and humidity [4]. Hence, Intergovernmental Panel on CC (IPCC) has called for urgent decisions to mitigate the consequences of CC [5]. Fossil fuels encompass hydrocarbons and their derivatives, containing oil, coal and gas. Fossil fuels are one of the key factors of CC [6]. Until now, fossil fuels have been a primary source of non-renewable energy. Unfortunately, the combustion of fossil fuels results in greenhouse gases and carbon dioxide emissions into the atmosphere. When the sunlight penetrates the Earth's atmosphere, these emissions trap the heat and keep it from escaping back to space, which leads to global warming in the long term. Global warning can seriously destroy life on our planet.

Due to the aforementioned detrimental impacts of fossil fuels, about 197 countries at the 28th Conference of Parties (COP28) agreed to gradually shift from traditional fossil fuels to renewable energy sources [7]. Also, there has been a concerted effort to harness the power of renewable energy technologies, such as solar [8], wind [9], hydro [10], and geothermal energy [11], which offer cleaner and more sustainable options for energy production. Moreover, the field of renewable energies has received increasing political and research attention to overcome CC and the energy crisis. Solar energy is regarded as one of the most ubiquitous renewable energy sources, as the sun is abundant throughout the year. Solar energy has many uses [12], such as water heating [13], food drying [14], heating buildings [15], irrigation [16], water distillation [17] and waste recycling [18].

With the help of Photovoltaic (PV) systems [19, 20], the solar irradiance is transformed into electricity. Solar energy is considered a source of clean energy as it does not produce greenhouse gas emissions or other air pollutants during the electricity generation. The irregular presence of solar irradiance and the fluctuating environmental weather conditions (humidity, cloud cover, temperature, wind speed, and wind direction) affect the operational process, management, and stability of PV grids, which in turn affect the amount of electricity production. Therefore, predicting solar irradiance in the future based on historical observations to maintain the availability of solar energy is commonly known as Solar Irradiance Forecasting (SIF).

SIF is considered one of the most challenging issues. In businesses and industries, major decisions, including production, purchasing, and marketing can depend on forecasting. When the solar irradiance is not present, it is accumulated in batteries for future utilization. Hence, the more accurate prediction of solar irradiance helps to provide permanent electricity by using the solar energy stored in batteries during the intervals of solar irradiance absence. The forecast horizon [21] is the time period into the future over which the forecast is made, while the forecast resolution represents the scale or size of the frame at the forecast horizon. There are numerous categories of forecasting horizons [22, 23], as shown in Fig. 1. The figure presents some useful applications that are appropriate for each forecast horizon.

Fig. 1
figure 1

Categorization of forecast horizons

The domain of solar irradiance forecasting has garnered growing attention by many researchers over the past few decades. The literature outlines several methodologies that can be divided into three categories: deterministic [24], ML or statistical [25, 26], deep learning [27, 28], and hybrid methods [29, 30]. However, deterministic techniques have certain drawbacks when predicting short-term forecasts [31]. Unlike these deterministic methods, ML and DL algorithms are considered more accurate to predict solar irradiance.

Short-term solar irradiance forecasting [32] is a challenging task that needs to be urgently solved to cope with the electricity production of many companies and factories. There is also the necessity of having a clean, renewable source of energy that can preserve our environment from pollution and global warming. All the aforementioned reasons motivate us to perform a comparative study to investigate the accuracy of many existing forecasting methods. Our work's primary contribution can be mentioned as follows:

  • This study investigates the performance of various deep learning algorithm employed for predicting short-term solar irradiance, such as artificial neural network, Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), recurrent neural network, temporal convolutional network, gated recurrent unit, echo state networks, residual neural network and the hybrid model CNN-LSTM.

  • Also, another comparison among many ML algorithms will be conducted for tackling the solar irradiance forecasting, including linear regression, stochastic gradient descent regression, random forest, least absolute shrinkage and selection operator, gradient boosting regression, decision tree regression and K-nearest neighbor regression.

  • The GSCV with fivefold cross validation is suggested for hyperparameter tuning for all ML and DL models.

  • The dataset was collected from Islamabad over five years, from 2015 to 2019, at hourly intervals using precise meteorological instruments.

  • Several statistical analyses are performed, such as Adjusted R2 score, NRMSE, MAD, MSE and MAE to examine the performance of different algorithm.

  • SHAP and LIME are examples of XAI that is responsible for interpreting and understanding the reasons behind getting specific results.

The main organization of the paper is shown as follows. Sect. "Literature Review" introduces a literature review of the recent algorithms developed for predicting the solar irradiance. Sect. "Time series analysis" illustrates the time series analysis and the process flow that is adopted by all ML and DL models in this paper. Sect. "Materials and methods" presents the structure of nine DL algorithms in addition to seven ML algorithms. In Sect. "Results analysis and discussion", the results are covered and discussed. Lastly, the conclusions and recommendations for further research are offered in Sect. "Conclusions and future works".

Literature review

Solar irradiance prediction is a challenging task that has garnered the attention of many researchers. Various advanced methods have proven their outperformance in predicting solar irradiance using ML and DL algorithms, aiming at maximizing solar energy production. Artificial Neural Network (ANN) is one of the most common DL model employed for SIF [33,34,35,36]. Integrating fuzzy logic with the ANN [37] improves the forecasting accuracy by 3.2% compared to the standalone ANN. Typically, fuzzy logic determines the correlation between features. Another study by Kumar and Kalavathi [38] proved that the ANN model outperforms the Adaptive Neuro-Fuzzy Inference System (ANFIS) model to predict PV power output.

Moreover, Qing and Niu [39] suggested an hour-ahead SIF in desert locations, considering the frequent dust storms. The MLP model outperforms other models, such as support vector regression, K-nearest neighbor, and decision tree regression, in terms of accuracy. The results of backpropagation ANN [40] show that predicting PV irradiance is better and more accurate compared to the Autoregressive Moving Average (ARMA). A comparative study [41] using six ML algorithms inferred that ANN is the most effective and predictive model out of the six algorithms. Gairaa et al. [42] aimed to forecast hourly global solar irradiation for time horizons from h + 1 to h + 6, using two approaches: multiple linear regression and ANN models.

Long short-term memory (LSTM) networks, which belong to the type of Recurrent Neural Networks (RNN), can learn about long-term dependencies while escaping the issue of gradient vanishing. LSTM [43] is another DL model proposed for the SIF problem and achieves better outcomes than support vector regression. Additionally, the LSTM model in [44] considers the interdependence between hours of the same day. The LSTM model [45] performs hyper-parameter tuning to find the optimal values of the parameters. After that, further data is added to see how it affects the improved model. In [46], LSTM is enhanced to predict the short-term and long-term solar irradiance for real-world data. Michael et al. [47] proposed a novel forecast model incorporating LSTM with Bayesian optimization and LSTM with a drop-out structure. To apply the LSTM model's predictions for solar radiation estimation, K-means and the red/blue ratio classify image pixels into clouds or the sky [48].

Researchers have adopted various models for SIF, including support vector machine [49], random forest [50], K-Nearest Neighbor [51], gated recurrent unit [52], echo state network [53], and temporal convolution network [54]. In this study [55], deep learning based on MLP, GRU, LSTM, CNN and CNN-LSTM models uses monthly data to estimate solar radiation. CNN outperforms other models with a MSE for global horizontal irradiance of 12.68 W/m2. The hybridization of the ML models offers a potent technique that exploits the strengths of various models. ALHM [56] is a hybrid forecast method that combines the ANN with a support vector machine, which has proven to be more accurate for five minutes ahead and daily forecasts.

Also, CNN is incorporated with LSTM [57], such that CNN is used to extract spatial features and LSTM is used to extract temporal features from historical solar irradiance. The work of Kumari and Toshniwal [58] proposed the LSTM-CNN by using LSTM to get the temporal features from time-series data of solar irradiance and then CNN to get the spatial features. Gao et al. [59] suggested integrating CNN-LSTM with CEEMDAN, which extracts features from constitutive series in historical data, in order to reliably predict solar irradiance. In [60], the hybrid model CNN-LSTM-MLP is mixed with the error correction and the VMD method to achieve the one-hour ahead solar irradiance forecasting. Also, Malakouti et al. [61] proposed the CNN-LSTM model for predicting the wind power, which demonstrated more accurate results. Guermoui et al. [62] developed an enhanced hybrid model for multi-hour ahead DNI forecasting. In [63], the authors applied the adjusted Boland–Ridley–Lauret (BRL) model was applied on two time scales, daily and hourly components.

Time series analysis

Time series analysis is a technique for analyzing data gathered over periods of time. In time series analysis, data comprises a set of observations or samples recorded at regular intervals throughout a predetermined time frame, as opposed to being recorded at random intervals. Time series analysis usually needs a lot of data points to guarantee consistency and reliability. Time series data is commonly used in forecasting, which predicts future data based on previous data. Figure 2 demonstrates the process flow and steps of analyzing the time series model for Global Horizontal Irradiance (GHI) data.

Fig. 2
figure 2

The process flow of the time series model

After gathering the solar irradiance data from photovoltaic grids, data preprocessing and feature extraction are performed to eliminate any noise, zero, and null values from this data. The next step conducts data normalization to eliminate biased findings and generate features that are on a comparable scale. There are two separate sets of data: the training set and the testing set. The model is trained with the chosen training set, and its performance is checked with a testing set. The final step discusses various statistical analyses done to investigate the performance of the trained models.

Materials and methods

Artificial Intelligence (AI) [64] encompasses methods that enable machines, especially computer systems, to imitate human intelligence in order to perform various tasks. ML is a part of AI that contains advanced techniques that allow machines to learn from data. Furthermore, DL is a branch of ML that uses neural networks for complex data processing. This section will introduce various ML algorithms employed for solving the SIF problem. Also, we will present the most famous algorithms used for tackling the problem in the field of DL. Figure 3 depicts a general view of AI, ML, and DL. Figure 4 exhibits the nine DL algorithms and seven ML algorithms that will be examined in our study.

Fig. 3
figure 3

General view of artificial intelligence, machine learning, and deep learning

Fig. 4
figure 4

Classification of solar irradiance forecasting models

Deep learning algorithms

The subsection delves into the most prominent deep learning models that are highly suggested for SIF, including Artificial Neural Network (ANN) [65], Multilayer Perceptron (MLP) neural network [66], Convolutional Neural Network (CNN) [67], Recurrent Neural Network (RNN) [68], Long Short Term Memory (LSTM) [69], Temporal Convolutional Network (TCN) [70], Gated Recurrent Unit (GRU) [71], Echo State Networks (ESN) [72], Residual Neural network [73], and finally the hybrid model CNN-LSTM [74]. The main structure and details about these DL algorithms will be explained below.

Artificial neural network

The ANN [75, 76] is a computational model consisting of linked neurons arranged into input, hidden, and output layers, designed to simulate the operations of the human brain, as seen in Fig. 5. It is highly recommended for recognizing patterns and forecasting tasks [77]. The input layer is mainly the first layer that receives initial data from the real world. ANN can contain one or more hidden layers. Data is passed from the input layer to the hidden layer, which processes it using learning methods and an activation function. Furthermore, the output layer is the last layer of ANN, consisting of a specific number of nodes representing the output values.

Fig. 5
figure 5

Architecture of ANN

Convolutional neural network

Similar to other forms of ANNs, CNNs [67, 78] are composed of interconnected artificial neurons through weights and arranged in layers. It is typically designed to learn and extract features from visual data or images. Through the use of convolutional layers, pooling layers, CNNs can learn from raw pixel data. By using CNN on time series data for forecasting, it becomes possible to effectively capture temporal relationships and extract important features. Additionally, CNN can interpret the incoming data as a sequence of one-dimensional signals. The CNN utilizes one-dimensional convolutions and pooling procedures to learn local patterns and detect temporal correlations within the data.

CNN comprises many layers like convolution layer, flatten layer, pooling layer, and output layer. For clarity, Fig. 6 displays the architecture of CNN. The convolution layer is usually called the feature extractor layer since it gets the best features of the image and time series data by sliding the filter over the next receptive field of the same data. The CNN employs the Rectified Linear Unit (ReLU) activation function for setting all negative values to be zero using the following equation (see Fig. 7):

$$f(x) = \max (0,x)$$
(1)
Fig. 6
figure 6

Architecture of CNN

Fig. 7
figure 7

Graphical representation of the ReLU activation function

The pooling layer helps to lessen the spatial volume of data after the convolution layer. There are many pooling methods, such as average pooling, max pooling, and L2-norm pooling. The max pooling approach chooses the highest value within the window, as can be seen in Fig. 8. It can be found after a convolution layer or between two of them but not after the fully connected layer to lessen the computational cost. The flatten layer is located between the CNN and the ANN, and its function is to convert the output of the CNN into an input that the ANN can process to learn complex patterns and make predictions. Finally, the output layer obtains the final output.

Fig. 8
figure 8

Max pooling technique

Recurrent neural network

Unlike other neural networks architectures, RNN [79] enables bidirectional information flow through both forward and backward propagation. RNNs are suggested for handling sequence and time series data [80], such as audio, sensor data [81], and text [82]. The RNN consists of three layers: an input layer (\(x\)), a recurrent hidden layer (\(h\)), and an output layer (\(O\)), as shown in Fig. 9. It can be modeled mathematically as follows [83]:

$$h_{t} = f_{RNN} (U \times x_{t} + W \times h_{t - 1} + b)$$
(2)
$$\hat{y}_{t} = \sigma (Vh_{t} )$$
(3)

where \(x_{t}\) and \(h_{t}\) are is the input and the hidden states at the current time step \(t\). Furthermore, \(h_{t - 1}\) refers to the hidden state at the previous time step. The variable \(\widehat{{y_{t} }}\) refers to predicted output at a time step \(t\). The parameters \(U\), \(W\), and \(V\) are the weight vectors for input, hidden, and output layers, respectively. The symbol \(f_{RNN}\) indicates the used activation function and \(b\) indicates the bias term. The \(\sigma\) is the sigmoid activation function. The feedback loop in its current hidden layer enables the RNN to keep the memory of past information. This short-term memory makes the network to analyze previous inputs when producing output and revealing the relationships between data points that are far from each other (Table 1).

Fig. 9
figure 9

Architecture of RNN cell

Table 1 Types of RNN

Table 2 encounters various types of RNN according to the number of input/output layers. Also, the table mentions the best suited application for each type of RNN. RNNs work by applying back propagation through time. In this manner, weights for the current and previous inputs are updated by propagation of the error from the last time step to the first one. An advantage of RNN is that the model size does not increase with the size of the input. Long time steps cause weight gradients to vanish, becoming small numbers close to zero. Thus, the network does not learn and isn't suitable for long-term dependencies.

Table 2 Specifications of the used weather data

Long short term memory

LSTM can tackle the vanishing gradient problem faced by RNN, since it depends on gates mechanism by introducing input, output, and forget gates [83]. Figure 10 shows the LSTM cell structure. These gates control which information is retained throughout the network. The forget gate (\(f_{t}\)) seeks to forget and delete any irrelevant information from the LSTM cell in a specific time step, as can be seen in Fig. 11. The input gate (\(i_{t}\)) adds and updates new information while the output gate (\(O_{t}\)) returns the updated information. The mathematical model of LSTM cell is defined as follows:

$$f_{t} = \sigma (w_{f} x_{t} + U_{f} h_{t - 1} + b_{f} )$$
(4)
$$\widehat{{c_{t} }} = \tanh (w_{c} x_{t} + U_{c} h_{t - 1} + b_{c} )$$
(5)
$$i_{t} = \sigma (w_{i} x_{t} + U_{i} h_{t - 1} + b_{i} )$$
(6)
$$c_{t} = f_{t} \odot c{}_{t - 1} + i_{t} \odot c_{t}$$
(7)
$$O_{t} = \sigma (w_{0} x_{t} + U_{0} h_{t - 1} + b_{0} )$$
(8)
$$h_{t} = O_{t} \odot \tanh (c_{t} )$$
(9)

where the parameters \(w_{i}\), \(w_{c}\), \(w_{f}\), \(w_{o}\), \(U_{i}\), \(U_{f}\), \(U_{c}\) and \(U_{O}\) represent weight vectors. \(x_{t}\), \(h_{t}\) and \(c_{t}\) indicate the input, hidden state and cell state at a time step \(t\). \(h_{t - 1}\) and \(c_{t - 1}\) indicate the previous hidden state and the previous cell state at a time step \(t - 1\). Moreover, \(\widehat{{c_{t} }}\) determines the candidate memory whereas the symbol \(\odot\) determines the element wise dot product. \(b_{f}\), \(b_{c}\), \(b_{O}\) and \(b_{i}\) are considered bias terms. \(\sigma\) and \(\tanh\) refers to the sigmoid and the \(\tanh\) activation functions.

Fig. 10
figure 10

Architecture of LSTM cell

Fig. 11
figure 11

The function of main LSTM gates

Gated recurrent unit

Rather than LSTM, GRU is another network that handles the vanishing gradient problem by introducing the reset and update gate. Despite being similar to LSTM, GRU has few advantages. Since GRU has fewer variables than LSTM, its architecture is simpler and more compact. The structure of the GRU cell is given in Fig. 12. These gates determine which information is retained throughout the network. The mathematical model of the GRU cell can be formulated as:

$$r_{t} = \sigma (w_{r} x_{t} + U_{r} h_{t - 1} + b_{r} )$$
(10)
$$z_{t} = \sigma (w_{z} x_{t} + U_{z} h_{t - 1} + b_{z} )$$
(11)
$$\widehat{h}_{t} = \tanh (w_{h} x_{t} + U_{h} h_{t - 1} + b_{h} )$$
(12)
$$h_{t} = z_{t} \odot h_{t - 1} + (1 - z_{t} ) \odot \widehat{h}_{t}$$
(13)

where \(r_{t}\) and \(z_{t}\) indicate both the reset gate and the update gate. \(w_{r}\), \(w_{z}\),\(w_{h}\), \(U_{r}\), \(U_{z}\) and \(U_{h}\) refer to the weight vectors for input and hidden states. \(b_{r}\), \(b_{z}\) and \(b_{h}\) represent bias terms. \(h_{t}\) and \(\widehat{h}_{t}\) are the hidden state and the candidate one at a time step \(t\) where as \(h_{t - 1}\) is the previous hidden state. \(\sigma\) and \(\tanh\) are the sigmoid and the \(\tanh\) activation functions.

Fig. 12
figure 12

Architecture of GRU cell

Temporal convolutional network

TCNs [84, 85] are a powerful tool for handling sequence data, and it offers several benefits over traditional sequence models, such as RNN and LSTM. Figure 13 exhibits the architecture of TCN. TCNs are parallelizable networks and they can avoid gradient issues (vanishing gradients and exploding gradients). Also, they can handle sequences of varying lengths even long-term and short-term which makes it a versatile choice for many time series and sequence-based tasks. Unlike other convolutional networks, TCN employs the causal and dilated convolutions to handle the increased network depth for longer inputs [86]. The dilation factor (d = 1, 2, 4) is generally doubled at each layer. TCNs use 1D convolutional layer, where each layer in the network sees all previous layers' outputs. The TCN model also contains a residual block structure, similar to ResNet [87, 88], for making a skip connection and to facilitate training of deep networks and prevent vanishing gradient descent.

Fig. 13
figure 13

Architecture of TCN

Echo state networks

ESN [89, 90] is a type of RNN. It consists of an input layer (\(u_{t}\)), a reservoir layer (\(w_{res}\)) and an output layer (\(y_{t}\)), as depicted by Fig. 14. The reservoir layer comprises a large number of recurrently connected nodes. In the figure, \(w_{in}\) refers to the connection weights between input and hidden layers. Additionally, \(w_{out}\) indicates the connection between the hidden layer and the output layer. The connection weights between the input layer and the reservoir layer remain unchanged once they have been initialized [91]. Only the weights of the output layer can be trained that simplifies the training process reducing computational complexity. ESN is beneficial for time series forecasting because it can successfully extract temporal dynamics and nonlinear patterns from data.

Fig. 14
figure 14

Architecture of ESN

Residual neural network

Residual networks [92] are characterized by skipping connections or jumping over some layers (see Fig. 15). The residual connection performs an identity mapping to \(x\), then the element-wise addition will be done \(x + F(x)\). After that, the ReLU activation function will be applied to \(x + F(x)\). If \(x\) and \(F(x)\) has the same dimension, the element-wise addition will be used. Otherwise, the identity mapping with a linear transformation will be done as \(w.x + F(x)\), where \(w\) is a weight matrix. They allow gradients to directly flow through a network, without passing through non-linear activation functions that make neural networks to fall into the gradient vanishing problem.

Fig. 15
figure 15

Residual connection

CNN-LSTM

A CNN-LSTM [93] is a hybrid model that incorporates CNN layers at the beginning of CNN-LSTM model to extract relevant features from input data. These features are then passed to the subsequent LSTM layers for interpreting the features across time steps [94]. The architecture of the CNN-LSTM model is depicted in Fig. 16, containing a fully connected layer of large number of neurons, organized in three layers: input, dense and output.

Fig. 16
figure 16

Architecture of the hybrid CNN-LSTM model

Machine learning algorithms

This subsection presents some of the most popular ML algorithms like linear regression [95], Stochastic Gradient Descent (SGD) regression [96], Least Absolute Shrinkage and Selection Operator (LASSO) [97], random forest [98], gradient boosting regression [99], decision tree regression [100] and K-Nearest Neighbor (KNN) regression [101]. These regression models are regarded as popular tool for statistical analyses and they will be explained below.

Linear regression

Linear regression [102, 103] is a particular type of supervised ML algorithm employed for tackling regression issues and estimating the correlation between two both variables. It postulates a linear relationship between the independent variable (feature) and the dependent one (target) with the objective of finding the best fit line that expresses the relationship. The line is found by lessening the sum of squared differences between the real values and predicted ones. The generalized formulation for linear regression is:

$$\hat{y} = X\beta + b$$
(14)
$$\beta = (X^{T} X)^{ - 1} X^{T} y$$
(15)
$$b = mean(y) - mean(X\beta )$$
(16)
$$\hat{y}_{new} = X_{new} \beta + b$$
(17)

where \(\hat{y}\) indicates the predicted value.\(X\) denotes the matrix of input features. \(\beta\) defines a vector of weights or coefficients.\(b\) is the intercept term (bias).\(X_{new}\) refers to the new data. \(\hat{y}_{new}\) is the new predicted Value. The used weights vector for linear regression is \(\beta\) = [134.77742582, 135.5658808, 32.3798911, -13.60180839, 14.06928621, -18.5702262, -1.53548565, 46.40629438, 11.85316545], while the intercept value (\(b\)) is equal to 315.6134357765278.

Stochastic gradient descent regression

One of the supervised ML algorithms that may be used to tackle regression issues is called SGD regression [104]. The iteratively updating of the model weights using a small random portion of the training data instead of the entire draining data makes SGD a common choice for large-scale regression tasks.

Least absolute shrinkage and selection operator

LASSO regression [105] is regarded as an extension of Ordinary Least Square (OLS). With OLS, the estimates may suffer from high variance and lack of interpretation due to the presence of large number of predictors [106]. Hence, LASSO comes to handle the aforementioned drawbacks of OLS. It is used for regression problems with more accurate prediction. The main objective of LASSO regression is to discover the values of the coefficients that reduce the summation of the squared differences between the actual values and predicted values [107]. LASSO function adds a penalty term to the loss function of OLS equal to the absolute value of the weights associated with each feature variable. The loss function for LASSO regression can be expressed as below:

$$LASSO_{loss} = OLS_{loss} + Penalty\_term = \sum\limits_{i = 1}^{m} {(y_{i} - \sum\limits_{j = 1}^{p} {\hat{y}_{ij} w_{j} } )^{2} + \alpha \sum\limits_{j = 1}^{p} {|w_{j} |} }$$
(18)

where \(m\) and \(p\) refers to the number of observations and dimensionality of features. \(\alpha\) is a hyperparameter that indicates the strength of the regularization.\(w_{j}\) are the assigned weights for LASSO. L1 regularization moves any weight values to zero to allow other coefficients to take non-zero values. \(y_{i}\) denotes the \(i^{th}\) actual value and \(\hat{y}_{ij}\) represents the \(i^{th}\) predicted value of feature \(j\). The used LASSO weights vector in our work is [134.78748341, 135.60866578, 32.32577104, −13.64606087, 12.66907409, −17.07588968, −1.32131259, 46.09668833, 11.72830064] and LASSO Intercept is equal to 315.6143943704034.

Random forest

Random forest [108] is an ensemble technique that has the ability for solving both regression and classification problems with the use of multiple decision trees. It supports the bagging technique where each decision tree is trained on a random subset of the training data. It has gained a significant popularity in recent years because of its impressive performance, simplicity in implementation and little processing requirements [109]. It is used to deal with bias-variance trade-offs for reducing the variance of a prediction model.

Random forests [110, 111] involve creating multiple decision trees during training and then outputting the average prediction (in the case of regression) of the individual trees. Each tree is trained on a random subset of the data and features, which helps to reduce overfitting and improve generalization. The pooling of multiple trees helps stabilize predictions and increase accuracy. Additionally, this approach provides better interpretability and faster training times in many practical applications. The random forest approach contains three parameters to adjust:

  • Number of estimators represents the number of trees. Increasing the number of estimators can lead to higher accuracy, but it is computationally expensive.

  • Maximum features indicate the number of features for making a split decision.

  • Maximum depth of a tree indicates how deep the tree can grow. The deeper the tree, the more complex it is.

Gradient boosting regression

Gradient boosting regression [112] is a supervised ML problem based on the idea of an ensemble method derived from a decision tree. It seeks to minimize the loss function of the gradient boosting regression model through the addition of many weak learners using gradient descent. Theses decision trees are built in a greedy approach with split points that minimize the function loss.

Decision tree regression

A decision tree regression is a ML algorithm that aims to predict a continuous target that partitions the features/variables into regions and passing predictions to each region based on feature values. It works by constructing a tree-based structure by recursively splitting the feature space based on input values. The leaves reflect the values of target variables values, while the branch lines represent the combinations of input variables that result in these values [113]. It predicts continuous target variables by traversing the tree and assigning values to unseen data points.

KNN regression

KNN is another ML algorithm that is highly recommended for regression and classification tasks. The model constructs its prediction based on finding the K nearest data points to a particular input by averaging the observations in the same neighborhood or choosing the majority class [26]. The distance between the neighbors and the given data can be calculated using Euclidean distance. The Euclidean distance (ED) between two points \(x = (x_{1} ,x_{2} ,...,x_{d} )\) and \(y = (y_{1} ,y_{2} ,...,y_{d} )\) in a \(d\)-dimensional space is calculated as:

$$ED_{x,y} = \sqrt {\sum\limits_{j = 1}^{d} {(x_{j} - y_{j} )^{2} } }$$
(19)

The KNN is built by K neighbors and each of the neighbors is assigned a weight based on its distance to the query point. The closer a neighbor, the higher its weight is. The weight of the \(i^{th}\) neighbor (\(w_{i}\)) is given by:

$$w_{i} = \frac{1}{{ED_{{x,y_{i} }} }}$$
(20)

\(ED_{{x,y_{i} }}\) is the distance between the query point \(x\) and its \(i^{th}\) neighbor \(y_{i}\). The predicted value (\(\hat{y}\)) for the query point is the weighted average of the target values of its nearest neighbors that can be computed by:

$$\hat{y} = \frac{{\sum\limits_{k = 1}^{K} {w_{k} .y_{k} } }}{{\sum\limits_{k = 1}^{K} {w_{k} } }}$$
(21)

\(y_{k}\) is the target value of the \(k^{th}\) neighbor, where \(k = 1,2,...,K\).

Interpretable results

Explainable Artificial Intelligence (XAI) allows anyone to understand the logic or reasons produced by the ML model. It aims to develop AI systems that demonstrate more transparency in their decision-making processes. The level of interpretability of a model directly correlates with the ease of understanding the underlying justifications for certain judgments or projections. The ML model is more interpretable than another if its explanations are more understandable to a person than the choices made by the other model. Understanding the underlying reasoning behind a model's prediction is crucial for the widespread adoption of prediction in many applications.

Shapley additive explanations

SHapley Additive exPlanations (SHAP) [114] utilizes the principles of game theory to elucidate the predictions of any ML model by quantifying the impact of each feature on the prediction output. The SHAP method identifies the primary features that influence the model's forecast. SHAP approximates the original model to a specific input and reduces the impact of missing features as follows:

$$g(z{\prime} ) = \varphi_{o} + \sum\limits_{j = 1}^{m} {\varphi_{j} z_{j}{\prime} }$$
(22)

where \(g\) represents the explanation model. \(z{\prime} \in \{ 0,1\}^{m}\) is a binary variable where 1 indicates that the simplified features are same as the original and 0 indicates that the features are not the same as the original. Moreover, \(\varphi_{j}\) refers to the attribute effect for the \(j^{th}\) feature such that \(j = 1,...,m\) and \(m\) determines the number of simplified features.

Local interpretable model-agnostic explanation

The Local Interpretable Model-Agnostic Explanation (LIME) [115] serves as a "explainer" to elucidate the prediction outcome for individual data samples. The LIME output consists of interpretations that depict the contribution of each feature to the prediction for a specific sample, serving as a means of providing local interpretability.

Results analysis and discussion

This section will explore the solar irradiance dataset and the results of different models from ML and DL on this data. SubSect. "Environment setup" will present the configuration of the experimental environment in which our tests are conducted. The characteristics of the solar irradiance dataset are outlined in the following subSect. "Dataset description". SubSect. "Dataset distribution" explains the distribution of data. The preprocessing steps and feature selection are discussed in subSect. "Data preprocessing and feature selection". An illustration of the data that has been divided into training, testing, and validation may be seen in subSect. "Data split". SubSect. "Hyperparameters tuning" introduces the GSCV to optimize the hyperparameters of various ML and DL models. SubSect. "Evaluation metrics of algorithms for solar irradiance forecasting" presents the models evaluation for checking their performance. Finally, the results of DL and ML models are discussed and statistically analyzed in subSect. "Results for deep learning models" and subSect. "Results for machine learning algorithms", respectively.

Environment setup

For fair comparison among all algorithms, they are implemented and run on the same data taken from Kaggle website. All algorithms are trained with GPU NVidia Tesla P100 and RAM of 16 GB. They are developed using Python environment of Version 3.10.12.

Dataset description

This study utilizes global horizontal solar irradiance data acquired in Islamabad, located at 33.64°N, 72.98°E and 500 m above sea level with meteorological parameters. The data was gathered over five years from 2015 to 2019 taken at hourly intervals by employing accurate meteorological devices. The weather data contains 14 columns. The dataset comprises 41256 samples. The characteristics and specifications of the data are listed in Table 2.

Dataset distribution

To understand the distribution of data, we employ several statistical measures, including mean (\(Mean\)) and standard deviation (\(Std\)) that are calculated as in Eq. (23) and Eq. (24), respectively. Furthermore, box plots [116] are utilized to summarize data distributions, detect any skewness, identify outliers, and make comparisons between distributions. They offer five measures, including the minimum (Min), first quartile, median, third quartile, and maximum (Max) of the data values, that help to identify data spread and identify potential anomalies.

$$Mean = \frac{{\sum\limits_{i = 1}^{n} {x_{i} } }}{n}$$
(23)
$$Std = \sqrt {\frac{{\sum\limits_{i = 1}^{n} {x_{i} - Mean} }}{n}}$$
(24)

\(x_{i}\) is the \(i^{th}\) observation in data, where \(1 \le i \le n\). \(n\) indicates the number of observations. Table 3 introduces the statistical analysis of data distribution. From the table, the dataset has different distributions, where some variables have a wide range of values with significant variations, such as GHI and DNI, while other variables have smaller ranges and lower variations, such as T_amb, RH, WS, WS_gust, WD, WD_std, and BP. The variables dni_dev and cleaning have a substantial number of zero values.

Table 3 The statistical analysis on the solar irradiance dataset distribution

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while retaining the variation in the data. This is particularly useful when dealing with high-dimensional data to make it easier to visualize and analyze, without losing significant information. From Fig. 17, it is evident that the cumulative explained variance plateaus after 9 components. This means that while PCA can reduce the dimensions, the first 9 components or features already capture most of the variability in the data. Since our data originally contains 9 features, which is the total number of features in the original dataset, would not provide any dimensionality reduction benefits.

Fig. 17
figure 17

PCA reduction technique for the weather data

Furthermore, Fig. 18 displays the importance of each feature. A higher importance value means that the specific feature will have a larger effect on the model that is being used to predict the target value GHI. The input DHI values have a significant effect on the target GHI with an importance value of 1.441757. Moreover, Fig. 19 shows the relationship between each feature and the other features. Figure 20 depicts the entropy [117] for the data samples over time, which takes a value from 0 to 1. A higher entropy value indicates a more uncertain or random distribution, whereas a lower entropy value corresponds to a more predictable or deterministic distribution. By inspecting the figure, we can see that most entropy values lies in the interval from 0.4 to 0.8. When the entropy is moving toward 1, it means that the dataset has a random distribution, which puts more challenges in predicting the solar irradiance.

Fig. 18
figure 18

The importance of each feature of the weather data

Fig. 19
figure 19

The relationship between each feature and the other features

Fig. 20
figure 20

Entropy for the data samples over time

Data preprocessing and feature selection

Data preprocessing and feature selection ensure data quality, data normalization, and noise reduction. Data distribution shows that dni_dev and cleaning features have many zero values. Hence, dni_dev and cleaning is dropped from data, as well as unnamed: 0 is dropped as it contains the index of samples. Also, the contribution and correlation of these features to the output is weak, thus deleting them will be advantageous. From the Person correlation plot in Fig. 21, the darker degree of red and blue colors indicates a strong correlation, while the brighter one indicates weak correlation. Finally, the remaining features that will be taken into consideration are GHI, DNI, DHI, T_amb, RH, WS, WS_gust, WD, WD_std and BP.

Fig. 21
figure 21

The heatmap correlation of the features

The analysis reveals that a large number of values, approximately 19,403 from GHI are equal to zero. Indeed, the zero values represent night values, and it is clear that they are zero owing to the absence of solar radiation. Thus, it is advisable to clean the data at night from 19:00 to 5:00 in the morning and only utilize data from the daylight period when solar radiation is present. Zero values will be dropped to 4236 after this process.

Since there are several data distributions—wide range and small range—data normalization is crucial for eliminating bias results, creating features of a comparable scale, enhancing ML algorithm stability and convergence, improving model accuracy and generalizability by minimizing outliers, and improving efficiency by minimizing complexity and storage needs. The standard scalar normalization (\(z\)) of data can be calculated by:

$$z = \sqrt {\frac{{x_{i} - Mean}}{Std}}$$
(25)

where \(x_{i}\) refers to the \(i^{th}\) value from data to be normalized. \(Mean\) and \(Std\) are computed as in Eq. (24) and Eq. (25), respectively.

From the statistical analysis of the data nature, input data contains DNI, DHI, T_amb, RH, WS, WS_gust, WD, WD_std and BP that has a high correlation with the target data (GHI) with a window size of 4 for neural networks. The window size in a neural network is crucial for effectively capturing local information, compressing information, extracting attributes at different scales, and achieving a balance between local and global information.

Data split

After the normalization step, the data split phase starts, as depicted in Fig. 22. For ML algorithms, the data is divided into training and testing sets for the standard training and evaluation process. However, for DL algorithms, it is common to split the data into training, testing, and validation sets. We use the training set to train the DL model, the testing set to assess its performance, and the validation set for hyperparameter tuning and model selection. This helps in choosing the best hyperparameters for the model and comparing the performance of different models due to the complexity of DL algorithms compared to ML ones. The validation set is unnecessary for ML algorithms because they have simple model architectures in contrast to DL models that have more complicated structures.

Fig. 22
figure 22

Data split for DL and ML algorithms

For the DL algorithms the data is divided into three sets: training, validation and testing, with 70%, 15%, and 15%, respectively as adopted by many previous studies [43, 118,119,120]. For ML approaches, the data is divided into two sets: training and testing, with 20,000 records of the data for the training process and 5,785 records for the testing process [36].

Hyperparameters tuning

GSCV [121, 122] is a powerful technique for optimizing the hyperparameters of DL models. Cross-validation is a crucial method used to build better-fitting models by training and testing on all parts of the training dataset. The grid search with fivefold cross-validation [123, 124] is utilized on the training and validation sets and the testing set is kept away for final evaluation. The number of folds can be increased or decreased. A higher number of folds value may lead to a less biased model, and more complex model. A lower number of folds value may like the simple train-test split method. fivefold cross validation [125, 126] is adopted in many previous works. In GSCV, the training set is partitioned into five equal-sized folds or parts. During the five iterations, one fold is used for testing while the remaining folds are used for training. This process is repeated until all folds have been used for testing.

Figure 23 depicts the used grid search with fivefold cross-validation for DL algorithms. For fair comparison among all DL algorithms, it has been taken into account that the models are trained and validated on the same parts of the data, as can be seen by the figure. Also, the grid search cross-validation is performed on the training set for better hyperparameters tuning, as can be illustrated in Fig. 24. Moreover, each algorithm is trained and tested on the same parts of the data, as in the figure.

Fig. 23
figure 23

Grid search cross-validation for DL algorithms

Fig. 24
figure 24

Grid search cross-validation for ML algorithms

Evaluation metrics of algorithms for solar irradiance forecasting

Various performance measures are employed for evaluating all the models, including Mean Squared Error (MSE), R2 score, Adjusted R2 score, Median Absolute Deviation (MAD), Root Mean Squared Error (RMSE), Normalized Root Mean Squared Error (NRMSE) and Mean Absolute Error (MAE). MSE is a primary performance metric that measures the average of the squares of errors and computed by:

$$MSE = \frac{{\sum\limits_{i = 1}^{n} {(y_{i} - \widehat{y}_{i} )^{2} } }}{n}$$
(26)

\(n\) indicates the number of the samples and \(y_{i}\) indicates the \(i^{th}\) actual value, where \(i = 1,...,n\). \(\widehat{{y_{i} }}\) represents the \(i^{th}\) predicted value. R2 score is a statistical indicator that indicates how closely is the regression predictions from the actual data points. The 1.0 value means that the regression predictions completely fit the actual data. When the R2 score value moves from one to zero, the predictions move away from the actual data. R2 score can be calculated by:

$$R^{2} (\% ) = 1 - \frac{{\sum\limits_{i = 1}^{n} {(\widehat{y}_{i} - \overline{y}_{i} )^{2} } }}{{\sum\limits_{i = 1}^{n} {(y_{i} - \overline{y}_{i} )^{2} } }} \times 100$$
(27)

\(\overline{y}\) represents the mean of all the values. R2 score measures the overall fit of the model to the data and has a tendency to increase when additional independent variables are added to the model, even if they have little or no explanatory power. This can lead to overfitting, while adjusted R2 score considers the goodness-of-fit while taking into account the number of predictors and avoids overfitting. It can be computed as:

$$Adjusted{\mkern 1mu} R^{2} = 1 - \left\lceil {\frac{{(1 - R^{2} )(n - 1)}}{{(n - k - 1)}}} \right\rceil$$
(28)

where \(k\) represents the number of independent variables in the model. RMSE is a common performance metric measures the average difference between predicted values and real values. A lower RMSE indicates a superior model and more accurate predictions. The lower RMSE determines that there is a small difference between the real and the predicted values. RMSE can be formulated mathematically as follows [127]:

$$RMSE(w/m^{2} ) = \sqrt {\frac{{\sum\limits_{i = 1}^{n} {(y_{i} - \widehat{{y_{i} }})^{2} } }}{n}}$$
(29)

The Normalized RMSE (NRMSE) is an extension for RMSE which is more suitable for different scales of data and can be calculated as follows:

$$NRMSE = \frac{RMSE}{{y_{\max } - y_{\min } }}$$
(30)

\(y_{\max }\) and \(y_{\min }\) determine the maximum and minimum actual values. Another metric to evaluate the models is Mean Absolute Deviation (MAD). MAD is mainly defined as the median difference between the observations (actual values) and model output (predictions). It can be expressed as:

$$MAD = Median(|y_{i} - \widehat{y}|)$$
(31)

\(Median\) is the midpoint of a data collection; half of the data points have values lower than or equal to it, and half have values higher or equal to it. MAE is defined as the average variance between the real and predicted values and computed using Eq. (32) [127].

$$MAE(w/m^{2} ) = \frac{1}{n} \times \sum\limits_{i = 1}^{n} {|y_{i} - \widehat{{y_{i} }}|}$$
(32)

Results for deep learning models

In this subsection, we will investigate the performance of many DL algorithms for tackling SIF problem. To ensure a fair comparison among all DL algorithms, they were trained under identical environmental circumstances. We used Keras API of version 2.12.0 which is a Python framework for deep learning to implement all our selected DL models. Keras is built on top of TensorFlow and offers a comprehensive set of tools for constructing and optimizing neural networks. The loss function for each model is also determined by squaring the mean error, as shown in Eq. (26). The Adam optimizer is utilized with all DL except for the DNN model, we use rmse optimizer. Table 4 records the best learning rate values obtained by the grid search cross validation for each DL model. The batch size is defined as 32, indicating the quantity of samples processed prior to updating the model. Moreover, the window size is equal to 4. The architecture of various DL models is stored in Table 5.

Table 4 Learning rate for each DL model
Table 5 The architecture of different DL models

The number of epochs is set to 100, where it represents the number of complete iterations through the training dataset. Early stopping is triggered if a fault arises during training or if there is no improvement in the model's validation performance for a specific period of epochs (\(patience = 5\)). The parameter \(patience\) represents the number of epochs with no improvement. The kernel size for CNN and CNN-LSTM is equal to 3. Additionally, the ReLU activation function is utilized by the hidden layer, whilst the linear activation function is utilized by the output layer. Furthermore, a lower dropout rate is set to 0.2 in order to prevent overfitting.

Table 6 records the results of various DL models, including ANN, CNN, RNN, LSTM, GRU, TCN, ESN, Residual NN, MLP and CNN-LSTM. We employ many performance metrics to assess the model's quality, such as the number of parameters, the R2 score,\(Adjusted R^{2} score\), MSE, RMSE, NRMSE, MAE and MAD. The bold font in the table indicates the best results obtained by the existing models. From the results, we can see that CNN-LSTM comes in the first rank by achieving the maximum Adjusted R2 score value of 0.984, which means the regression predictions are more closely aligned with the actual data than other comparing models. Also, CNN-LSTM obtains the minimum values for NRMSE = 0.036 and MSE = 1265.721, and the second minimum values for MAE = 16.461 and MAD = 8.498. The CNN achieved the second ranking compared to its counterparts with Adjusted R2 score of 0.982. On the other side, TCN comes in third rank with Adjusted R2 score = 0.981 and has the largest number of parameters with value of 2,176,257. Moreover, RNN comes in the fourth rank with Adjusted R2 score = 0.978, followed by LSTM, Residual NN, GRU, ANN, MLP and ESN. The number of parameters affects the model's complexity. MLP has the least number of parameters equal to 701.

Table 6 The results of the DL models for solar irradiance forecasting

Figure 25 displays the rank of each model using five performance metrics, including the Adjusted R2 score, MSE, NRMSE, MAE, and MAD. It can be seen that the CNN-LSTM model achieves the best results compared to their rivals. Furthermore, ESN has the worst performance. Figure 26 shows the sum of ranks over the previous five metrics to give a general view on the model ranking for solar radiance forecasting. The rank of DL models from the best to the worst comes as follows: CNN-LSTM, TCN, CNN, RNN, GRU, LSTM, Residual NN, MLP, ANN and ESN.

Fig. 25
figure 25

The rank of each DL model using adjusted R2 score, MSE, NRMSE, MAE and MAD metrics

Fig. 26
figure 26

The sum of ranks of each DL model

Figure 27 describes the hybrid CNN-LSTM model for predicting solar irradiance. The figure depicts that the model contains two convolution layers, one dropout layer, one pooling layer, two LSTM layers and one dense layer. Presentation of the CNN-LSTM model's training and validation loss curves can be found in Fig. 28. According to the figure, the CNN-LSTM has varying numbers of epochs for the different folds due to the implementation of the early stopping parameter. This model terminated the training process if the loss in both training and validation did not change during model fitting. It is clear that the loss values of the training set and the validation set for all folds are in good agreement with one another. This suggests that the predictions made by the proposed model are in line with the actual data, and that there is neither overfitting nor underfitting present. Figure 29 depicts the predictions of CNN-LSTM model compared to the actual ones for the first 100 time steps.

Fig. 27
figure 27

CNN-LSTM model summary

Fig. 28
figure 28

The loss curve for training and validation set for all five folds of CNN-LSTM model

Fig. 29
figure 29

CN-LSTM predictions

Results for machine learning algorithms

This section examines the performance of several ML models, such as linear regression, SGD regression, LASSO, gradient boosting regression, random forest, decision tree regression, and KNN regression. Table 7 contains the best parameters values of different parameters associated with each ML model that are found using the grid search cross validation technique. By inspecting the table, we display the various values of each parameter for all ML models. Also, the best hyperparameters values are recorded. We used Sci-kit API of version 1.2.2 which is a Python framework for ML to implement all our selected ML models.

Table 7 The best hyperparameters values found by grid search cross validation of each ML model

Table 8 presents a comparison among numerous ML algorithms, including linear regression, SGD regression, LASSO, random forest, gradient boosting regression, decision tree regression and KNN regression. The table shows that the gradient boosting regression gets the best results for most of the performance metrics. Gradient boosting regression has the highest Adjusted R2 score with a value of 0.962, which puts it at the top of the algorithms. It also obtains the minimum values for MSE = 3415.02, NRMSE = 0.058 and MAE = 29.33. KKN regression attains the second highest value of the Adjusted R2 score, which is 0.945. Moreover, linear regression comes in the third rank with an Adjusted R2 score value of 0.893.

Table 8 The results of ML algorithms

Figure 30 displays the ranking of the existing seven ML approaches according to five distinct evaluations: Adjusted R2 score, MSE, NRMSE, MAE, and MAD. We can observe that gradient boosting regressions attains the first ranking in five metrics, while SGD shows the worst performance. Also, Fig. 31 displays the ranking of several algorithms based on the total sum of rankings for all performance metrics. A lower sum of ranks indicates that the ML model performs better. The gradient boosting regression has a minimum value of the sum of ranks with a value of 5. Then, KNN comes in at the next rank after gradient boosting regression. SGD reserves the last rank with a value of 26, which demonstrates poor performance. Moreover, Fig. 32 depicts the predictions obtained by gradient boosting regression for the first 100 time steps, which has proven that the predicted values of gradient boosting regression are closely aligned with the actual values.

Fig. 30
figure 30

The rank of each ML model using Adjusted R2 score, MSE, NRMSE, MAE and MAD metrics

Fig. 31
figure 31

The sum of ranks of each ML model

Fig. 32
figure 32

Gradient boosting regression predictions

To explain the previous results obtained by ML models and the outperformance of gradient boosting regression, XAI with SHAP and LIME are introduced. Mean absolute SHAP values are often shown as bar plots that order features according to their importance, as seen in Fig. 33. The ordering of features and the respective magnitudes of the mean absolute SHAP values are the most important factors to consider. Here, we can see that DHI is the most influential feature contributing to the GHI output, whereas the least informative feature contributing to the GHI output is WS_gust.

Fig. 33
figure 33

Mean absolute values of the SHAP for all features

The bar plot in Fig. 33 did not tell us how the underlying values of each feature relate to the model's predictions. Figure 34 illustrates how the features contribute to the model's predictions. Each feature of the data is represented by a row in the plot. The horizontal x-axis indicates the SHAP values. The blue color refers to lower values of the feature, whereas the red color refers to higher values of this feature. The feature that has a wide spread of SHAP values, has a greater contribution to the model’s predictions. The feature values that are assembled around zero have a minimal impact on the model’s predictions. From the plot, we can see that the lower values of DHI have negative SHAP values, whereas the higher values of DHI have positive SHAP values. Also, DHI has a wide spread of SHAP values, which reflects the significant effect on the model's predictions. LIME plot in Fig. 35 shows that DHI and DNI have the most impact for output.

Fig. 34
figure 34

Summary of beeswarm plot ranked by mean absolute SHAP value

Fig. 35
figure 35

LIME plot

Conclusions and future works

Various DL and ML algorithms were employed for SIF. This paper presents a comparison for the most common nine DL algorithms, such as ANN, CNN, LSTM, GRU, RNN, TCN, ESN, CNN-LSTM MLP and residual NN. Another experiment is conducted to compare among seven of the existing algorithms, containing linear regression, SGD regression, LASSO, random forest, decision tree regression and KNN regression. The GSCV is employed to find the best hyperparameter values for all DL and ML models. The dataset was gathered from Islamabad over five years at hourly intervals using accurate meteorological instruments. Moreover, we measure the effectiveness of each model using various metrics like MSE, Adjusted R2 score, RMSE, MAE and MAD. Additionally, SHAP and LIME are used in our study to provide an explanation and comprehension for the acquired result of the best model. For deep learning algorithms, the statistical analysis shows that CNN-LSTM outperforms its counterparts, whereas gradient boosting regression achieves a superior results compared to other ML models.

Hybridization can be a powerful tool, as hybrid models can achieve better results than individual ones. Hence, in the future, we hope to develop a hybrid model for solar irradiance forecasting. Hyperparameter tuning strategies seek to find the optimal values of parameters that lead to improved performance and the learning process of ML algorithms. Other hyperparameter tuning can be a future direction through the integration of metaheuristic techniques to optimize the parameter values [128, 129]. Possible future directions to enhance the work would revolve mainly around different time horizons: short-term, medium-term, and long-term, to study the effectiveness of the proposed model. This study is interested in solar irradiance forecasting, and we aim at investigating other challenging problems, such as wind forecasting [130], PV power forecasting [131] and price forecasting [132].