1 Introduction

A common criticism leveled against models in modern machine learning is that their complexity makes it difficult to understand the reasons for their decisions. This is especially problematic when applying these models in domains where decisions must be carefully justified, such as healthcare and judiciary decision making (Kelly et al. 2020), making explainability an important tool when developing trustworthy AI. Another key factor in the development of trustworthy models is uncertainty quantification: the ability of models to give accurate assessments of the uncertainty inherent in their predictions. Although both explainability and uncertainty quantification are vital for the use of machine learning in high-stakes applications, remarkably little work has been done at their intersection.

Traditional explainable machine learning seeks to reveal the reasons for a model’s output or the reasons for its level of accuracy. These remain important pieces of information for understanding uncertainty-aware models, though equally important is the ability to determine the sources of uncertainty: the features that cause the model to be more or less confident in its prediction. If we can understand how different features contribute to uncertainty estimates, we can engineer systems capable of better decision making. As an illustrative example, say we have a model which predicts disease activity for a patient. The model may predict a low likelihood of disease activity, but with a high associated uncertainty. Understanding the source of this uncertainty is useful in determining how to handle the model prediction; if the uncertainty is due to an unusual value of a single feature, which is known to have high sensitivity and low specificity, it may be sensible to trust the model prediction. However, if the uncertainty is due to interactions between multiple features, this may be more indicative of an out-of-distribution case in need of further exploration.

Although there has been some recent work exploring this area for Bayesian neural networks (Depeweg et al. 2017; Chai 2018; Antoran et al. 2021), to the best of our knowledge, there has been only one recent attempt at a truly model-agnostic approach (Watson et al. 2023).

In this paper, we demonstrate that a range of existing simple techniques for explaining model output and performance can be modified to explain both the uncertainty and likelihood of the predictive distributions of uncertainty-aware models. In particular, we introduce novel adaptations of permutation feature importance (PFI), partial dependence plots (PDP), and individual conditional expectation (ICE) plots to explain how each feature available to a model affects its predictive distribution. We explore two complementary approaches: one looking at feature importance for the negative log-likelihood (loss), the other looking at the predictive uncertainty.

For the first approach, we introduce Likelihood-PFI, Likelihood-PDP, and Likelihood-ICE: likelihood-based variants of the methods listed above, which allow feature importance to be measured in terms of the effect a feature has on the model’s negative log-likelihood. Whereas conventional PFI and PDP/ICE explain the role that given features play in determining a model’s loss and output, respectively, our variants explain the role of those features in the likelihood that the model assigns to observed values of the target variable

For the second approach, we measure the uncertainty of the model’s predictive distribution through the entropy of that distribution, introducing Entropy-PFI, Entropy-PDP, and Entropy-ICE. These variants examine the role of features in determining a model’s uncertainty, and can be used to reveal how features may share information and how doing so may affect a model’s confidence in its predictions.

We examine the properties of both sets of empirical measures through the use of carefully constructed synthetic datasets, and demonstrate how they can be used in a model-agnostic way to derive insights from models trained on real-world datasets in both classification and regression settings.

In recent years, methods such as those listed above have faced scrutiny due to the potential for misleading results in the presence of statistical dependencies between features (Hooker et al. 2021). These results are due to the fact that permutation-based methods break dependencies, which can force the model to extrapolate. While this extrapolation can be an issue for traditional permutation-based methods, we will show that this is not the case for Entropy-PFI; instead, the effect that these dependencies have on model confidence is a critical component of what is being measured. From this interpretation, measuring Entropy-PFI alongside Likelihood-PFI can help to detect issues when dealing with feature dependencies, while not requiring training of additional models.

The remainder of the paper is structured as follows. In Sect. 2, we briefly review the fields of uncertainty quantification and feature importance, as well as give a more in-depth introduction to the main feature importance methods of interest in this work. In Sect. 3, we introduce our novel likelihood and entropy-based feature importance methods, as well as demonstrate their properties through experiments on synthetic datasets. In Sect. 5, we demonstrate how these techniques can be used to gain new insights on real-world datasets. Finally, in Sect. 6, we summarise our findings and give suggestions for future work.

2 Background

This section briefly reviews the fields of uncertainty quantification and feature importance, before introducing notation and describing some of the key methods in more detail.

2.1 Uncertainty quantification

Producing accurate measures of uncertainty requires that we are able to construct models that output a distribution over possible outcomes, as opposed to a single value deemed most likely by the model. To this end, several approaches have been proposed to create uncertainty-aware models, such as Gaussian processes (Williams and Rasmussen 2006) and Bayesian neural networks. For Bayesian neural networks, several approaches have been developed, including fully Bayesian networks (Neal 2012), approximations such as Monte Carlo dropout (Gal and Ghahramani 2016), along with a range of calibration methods (Guo et al. 2017). These uncertainty-aware models have their own unique strengths and weaknesses, although they all raise a set of common questions: Can we identify the sources of uncertainty for a model? And how do the features that increase model confidence differ from those that increase model performance? That is, how do we explain the uncertainty of a model?

Much previous work on explaining uncertainty revolves around decomposing the uncertainty into two types: epistemic uncertainty, i.e., uncertainty due to the finite amount of data available and limitations of the model; and aleatoric uncertainty, i.e., uncertainty that is inherent in the system that we are observing, which cannot be reduced by collecting more data (Depeweg et al. 2018). Epistemic uncertainty can be further decomposed into uncertainty about the suitability of the chosen model (structural uncertainty) and uncertainty in the choice of parameters given the specification of the chosen model (Liu et al. 2019). Various efforts have been made to identify how much uncertainty comes from each source: explicitly modelling aleatoric uncertainty via a noise parameter, estimating the role of epistemic uncertainty by modelling the density of the data in latent spaces (Mukhoti et al. 2023) and estimating aleatoric uncertainty by reducing epistemic uncertainty via ensembling (Shaker and Hüllermeier 2020), while other work critically examines the validity of this approach of decomposing uncertainty (Wimmer et al. 2023).

However, we are interested in more fine-grained explanations for the sources of uncertainty: explaining how the particular values of given features in an example affect the uncertainty of the model output. Little work has been done in this area, although there are notable exceptions. Depeweg et al. (2017) examined the role of features in the uncertainty of Bayesian neural networks through sensitivity analysis: examining the gradient of the uncertainty with respect to each feature in turn. Antoran et al. (2021) look at how uncertainty estimates can be explained in Bayesian neural networks via counterfactuals: identifying which constellations of features are responsible for uncertainty by finding sets of minimal changes required to increase a model’s confidence in its prediction. Similarly, Chai (2018) and Zhang et al. (2022) look at the importance of features through predictive difference, showing changes in the predictive distribution of Bayesian neural networks when features are replaced with non-informative features modelled from a conditional distribution. More recently, Watson et al. (2023) have considered how Shapley values could be used to explain both aleatoric and epistemic uncertainty in a completely model-agnostic manner. This allowed for local explanations of the behaviour of model uncertainty.

Our work shares the same motivation as these works: to explain not just the predictions of these models, but also the uncertainty associated with those predictions. Unlike Depeweg et al. (2017), Chai (2018), Zhang et al. (2022) and Antoran et al. (2021), we do not assume a particular structure for our model and data. Although similar in spirit to the work of Watson et al. (2023), our work uses techniques that are conceptually and computationally simpler, while still able to offer insight into model behaviour.

2.2 Feature importance

Early feature importance methods were developed in an ad hoc way to describe models of interest, rather than being the subjects of research in their own right. Permutation feature importance (PFI) and partial dependence plots (PDPs) were first introduced in articles on random forests (Breiman 2001) and gradient boosting (Friedman 2001), respectively. Although in both cases, the advantages and disadvantages of the feature importance measures were discussed, such a discussion was secondary to the main objectives of the article. Despite this, both methods have been widely adopted, adapted, and extended in the literature. Most notably for our purposes, Moosbauer et al. (2021) introduced the idea of including confidence bands on PDPs to measure the uncertainty associated with a cost function when applying PDPs to a cost function in Bayesian optimisation.

Although PFI and PDPs were introduced independently, they share many common features. Implicit in both is the assumption that we can break the dependence of the target variable from the feature of interest by sampling from that feature’s marginal distribution without generating an out-of-distribution sample. However, this is not a reasonable assumption—as was indeed observed in different respects in the original papers introducing the two measures (Breiman 2001; Friedman 2001).

These shortcomings have been addressed for PFI using conditional methods (Strobl et al. 2008; Molnar et al. 2023). In Strobl et al. (2008), the issue is resolved for random forests by taking advantage of the tree structure to break the feature space into a grid, with permutations carried out only between features which share a common region, allowing for an approximation of the conditional distribution. This issue is addressed in a model-agnostic way in Molnar et al. (2023): by splitting the test set into sub-groups in which the features of interest are conditionally independent of the other features using an additional model, then performing PFI with permutations restricted to only swap values for examples within the same sub-group. An alternative approach is to compare the model of interest with a second model in which the feature of interest has been completely removed, or the information in that feature has been destroyed by permutation in the training set (Hooker et al. 2021). These approaches are referred to as remove-and-relearn and permute-and-relearn, respectively. These approaches resolve many of the issues that permutation-based methods face but do so at greater computational cost, requiring at least one more model to be trained for each feature of interest.

For PDP, the issue of extrapolation can be partially resolved through the use of individual conditional expectation (ICE) plots (Goldstein et al. 2015). These de-aggregate the effects of individual test examples on PDP, allowing users to see how examples differ in their sensitivity to a particular feature. In this way, ICE plots can be used to reveal heterogeneity in the model behaviour for a given feature.

Another popular approach for local explanations is local interpretable model-agnostic explanations (LIME) (Ribeiro et al. 2016), in which a simple, easy-to-interpret model is used to approximate the more complex model of interest within a small region of interest. This local model is trained explicitly so that it is locally a good approximation of the target model, and therefore can be analysed to determine what factors affected the model’s decision at that point.

The SHAP algorithm is a method to derive point-wise explanations for a model’s output in a principled manner (Lundberg and Lee 2017). SHAP is derived from the game-theoretic notion of Shapley values, and is provably unique in satisfying a specific set of desirable properties (Lundberg and Lee 2017). Although SHAP values are theoretically sound, they suffer from problems of tractability. Several approximations are suggested by Lundberg and Lee (2017), and there are model-specific variants that allow for more accurate calculation (Lundberg et al. 2020), as well as model specific variants incorporating uncertainty Chau et al. (2024). A detailed discussion on the ways in which Shapley values can be computed and estimated can be found in Chen et al. (2023). Like permutation-based methods, both LIME and SHAP suffer when forced to extrapolate. For these methods, this difficulty comes in the form of adversarial attacks, which have been shown to allow a malicious actor to create false/misleading explanations by creating perturbed examples that are separate from the true data distribution (Slack et al. 2020).

Recently, a common framework unifying many of the methods described above, amongst others, was developed under the paradigm of “explaining by removing” (Covert et al. 2021). Using this framework, Covert et al. (2021) conducted a systematic exploration of the connections and differences between existing model explanation methods.

2.3 Permutation-based feature importance methods

In this section, we describe the key feature importance methods from the literature in more detail. These methods provide a tool for explaining how much a particular feature’s value is responsible for determining the performance of a model.

2.3.1 Permutation feature importance (PFI)

Define the random variables for the feature vector and the target as X and Y, respectively, and let \(P_{XY}\) denote the true joint distribution of the data, with \((X, Y) \sim P_{XY}\). Furthermore, let \(P_X\) denote the marginal distribution of X. We write \(X = (X_1, \ldots , X_d)\), with \(X_j\) being the random variable corresponding to the j-th feature, and denote its marginal distribution \(P_{X_j}\). Furthermore, we define \(X_{-j}\) as the vector of features with the j-th element omitted, that is, \(X_{-j}=(X_1, \ldots , X_{j-1}, X_{j+1},\ldots ,X_d)\), similarly denoting its joint distribution as \(P_{X_{-j}}\). Throughout, we use the convention that i indexes examples in a test set (with \(1 \le i \le n\)), j indexes features in a feature vector (\(1 \le j \le d\)) and c indexes class labels (\(1 \le c \le k\)).

PFI works as follows: given a trained model and a set of test examples \(\{(\textbf{x}^{(1)}, y^{(1)}),\ldots , (\textbf{x}^{(n)}, y^{(n)})\}\), with each \(\textbf{x}^{(i)} \in \mathbb {R}^d\), we construct a design matrix in \(\mathbb {R}^{n \times d}\), where the i-th row is the transpose of \(\textbf{x}^{(i)}\). To obtain the PFI measurement for the j-th feature, we randomly permute the j-th column and use the rows with that feature permuted as the feature vectors for our new test set. We refer to this new test set with the notation \(\{(\widetilde{\textbf{x}}^{(1)}, y^{(1)}),\ldots ,(\widetilde{\textbf{x}}^{(n)}, y^{(n)})\}\). With a cost function \(\mathcal {C}\), the empirical PFI for this test set is given as

$$\begin{aligned} \widehat{\text {PFI}}_\mathcal {C}(j) = \frac{1}{n}\sum _{i=1}^n \mathcal {C}(y^{(i)}, f(\textbf{x}_{-j}^{(i)}, \widetilde{x}_j^{(i)})) - \mathcal {C}(y^{(i)}, f(\textbf{x}_j^{(i)})). \end{aligned}$$
(1)

Here, we allow ourselves to use the convention adapted from Casalicchio et al. (2018) of writing \(f(\textbf{x}_{-j}, \widetilde{x}_j)\) to mean the function f with an input vector where the j-th entry is \(\widetilde{x}_j\) (i.e., the j-th entry of \(\widetilde{\textbf{x}}\)) and the other entries are populated using the entries of \(\textbf{x}_{-j}\). We may think of the PFI for the j-th feature as the difference in performance when the dependence of the model on the j-th feature is broken by permuting the j-th feature in the test examples.

The values obtained by PFI are specific to the test set considered. However, they are Monte Carlo approximations of a quantity defined in terms of the data distribution (Casalicchio et al. 2018). Consider \(X, Y \sim P_{XY}\), with \({X_{-j}}\) being a subset of the features of X. Let \(\widetilde{X}_j\) have the same marginal distribution as \(X_j\) but independent of X and Y. We have that

$$\begin{aligned} \text {PFI}_\mathcal {C}(j)&= \mathbb {E}_{{X_{-j}}, \widetilde{X}_j}[\mathcal {C}(Y, f({X_{-j}}, \widetilde{X}_j))] - \mathbb {E}_{X}[\mathcal {C}(y, f(X))]\nonumber \\&\approx \frac{1}{n} \sum _{i=1}^n \mathcal {C}(y^{(i)}, f(\textbf{x}_{-j}^{(i)}, \widetilde{x}_j^{(i)} )) - \mathcal {C}(y^{(i)}, f(\textbf{x}^{(i)}))\nonumber \\&= \widehat{\text {PFI}}_\mathcal {C}(j). \end{aligned}$$
(2)

A common modification of PFI is to condition the distribution of the feature of interest on the observed values of the other features. This way, conditional PFI (CPFI) is defined as

$$\begin{aligned} \text {CPFI}_\mathcal {C}(j) = \mathbb {E}_{{X_{-j}}, (\widetilde{X}^c_j|{X_{-j}})}[\mathcal {C}(Y, f({X_{-j}}, \widetilde{X}^c_j))] - \mathbb {E}_{X}[\mathcal {C}(y, f(X))], \end{aligned}$$

where \(\widetilde{X}^c_j\) is constructed such that it follows the conditional distribution of \(X_j|{X_{-j}}\) but \(\widetilde{X}^c_j|{X_{-j}}\) is independent of Y (Strobl et al. 2008).

2.3.2 Partial dependence plots (PDPs)

While PFI is a measure of the effect of a given feature on the model’s performance, given a ground-truth label, PDP gives a method of visualising a feature’s effect on the model output itself. In PDP, a single feature is kept constant, while all other features jointly assume values from an example in the test set; the average model output is then captured across the test set. For feature j, the PDP is found by plotting \(\widehat{\text {PDP}}(x_j; j)\) for all values \(x_j\), where \(\widehat{\text {PDP}}\) is defined as

$$\begin{aligned} \widehat{\text {PDP}}(x_j; j) = \frac{1}{n} \sum _{i=1}^n f(\textbf{x}_{-j}^{(i)}, x_j), \end{aligned}$$
(3)

This is a Monte-Carlo approximation of the true value of interest, that being

$$\begin{aligned} \text {PDP}(x_j; j) = \mathbb {E}_{X_{-j}}[f({X_{-j}}, x_j)]. \end{aligned}$$

The value of a PDP at point \(x_j\) can be thought of as the answer to the following question: given that all features except the j-th follow the joint data distribution, what is the expected value of the model output when fixing the j-th feature to the value \(x_j\). Note that like PFI, in computing this value, the model can be forced to extrapolate to sets of features that do not occur in the true data distribution.

An example of the kinds of curve produced by PDP is shown in Fig. 1. In this demonstration, we show the PDP curve for a single hidden layer multi-layer perceptron (MLP) on the diabetes dataset (Smith et al. 1988).

2.3.3 Individual conditional expectations (ICEs)

A PDP is most effective when the distributions of features are independent; as shown by Goldstein et al. (2015), a PDP can hide the real effects of varying a feature by considering only the average over the training distribution, rather than examining the effects on individual examples. To this end, Goldstein et al. (2015) introduced individual conditional expectation (ICE) plots, which show how the model outputs for individual examples change as the feature of interest is varied. For feature of interest j, the ICE plot for the test example i is given by plotting the function

$$\begin{aligned} \text {ICE}^{(i)} (x_j; j) = f(\textbf{x}_{-j}^{(i)}, x_j). \end{aligned}$$
(4)

The PDP curve is simply the average of ICE curves over the test set, as can be trivially observed from Eqs. 3 and 4.

An individual ICE curve shows how the model output for a training example would vary if we were to alter the value of the j-th feature. By plotting the ICE curve over the range of values the j-th feature takes in the data, we observe how the value of the j-th feature affects the model’s prediction for the given sample. In Fig. 1, we see that while the PDP curve shows the average behaviour, individual ICE curves show that the effects on model output of change a single feature can vary significantly depending upon the other feature’s values.

Fig. 1
figure 1

Example of PDP and ICE plots. The blue line shows the PDP. The ICE curves (gray) show how the output of the model changes for individual examples as the feature of interest changes

3 Explaining likelihood and uncertainty

In this section, we propose modifications to the feature importance methods described in Sects. 2.3.1 and 2.3.2 to capture explanations for the likelihood and uncertainty of the predictive distribution of uncertainty-aware models.

3.1 Likelihood-PFI

We begin with PFI for negative log-likelihood, which we refer to as Likelihood-PFI: this is a natural extension of the original PFI measure, with the only difference being that we exchange the loss function used in the traditional PFI setting with the negative log-likelihood of the target given the feature variables. Rather than defining a model output f, we now think of the model as giving a predictive distribution q and use the negative log-likelihood of the target given this distribution as our loss function.

Given a test set \(\{(\textbf{x}^{(1)}, y^{(1)}), \ldots , (\textbf{x}^{(n)}, y^{(n)})\}\) of size n, the empirical Likelihood-PFI is given by

$$\begin{aligned} \widehat{\text {PFI}}_\mathcal {L}(j) = \frac{1}{n} \sum _{i=1}^n \log q(y^{(i)} | X=\textbf{x}^{(i)}) - \log q(y^{(i)} | {X_{-j}}=\textbf{x}_{-j}^{(i)}, \widetilde{X}_j=\widetilde{x}^{(i)}_j ). \end{aligned}$$
(5)

In order to obtain the exact value of which this is an approximation, we write

$$\begin{aligned} \text {PFI}_\mathcal {L}(j) = \mathbb {E}_{X, Y} [ \log q(Y | X)] - \mathbb {E}_{\widetilde{X}_j, {X_{-j}}, Y}[\log q(Y|{X_{-j}}, X_j=\widetilde{X}_j)], \end{aligned}$$
(6)

where the \(X_j=\widetilde{X}_j\) in \(q(Y|{X_{-j}}, X_j=\widetilde{X}_j)\) denotes the fact that while under the model distribution, Y is conditional on \(X_j\), not \(\widetilde{X}_j\), when we take the expectation, we treat \(X_j\) as being distributed according to the marginal, rather than the joint distribution. This is clarified further in Appendix A.

Note that the order of terms in Eqs. 5 and 6 is reversed in comparison to the terms in the original PFI definitions (Eqs. 1 and 2). This is because the function under consideration here is the negative log-likelihood, and reversing the order of terms allows us to avoid a double negative. However, the quantity of interest is still the loss given the permuted feature minus the loss given the original feature value.

For the regression case, the Likelihood-PFI is related to the PFI for deterministic models; for instance, given a fixed-width Gaussian predictive distribution, the Likelihood-PFI reduces to the PFI of the squared loss between the predictive mean and the observed value of the target variable (up to a constant scaling factor). For the classification case, PFI is typically performed on either the misclassification rate Breiman (2001) or the AUC (Molnar 2022), whereas our approach would reduce to performing PFI on the log-loss. Unlike using the misclassification rate, this approach allows for detection of small changes in the model’s confidence that are not significant enough to change the class prediction, and unlike AUC it generalises naturally from binary classification to a multi-class setting. Like traditional PFI, we expect Likelihood-PFI to be non-negative: removing information from an informative feature will increase the loss, whilst permuting uninformative features will have no effect.

3.2 Entropy-PFI

Likelihood-PFI gives a measure of how the model performance is affected by re-sampling a feature from its marginal distribution. However, it is also useful to know how the uncertainty of a model is affected by the value of each feature. To this end, we next look at how PFI can be performed for the entropy of its prediction rather than for its accuracy.

The Shannon entropy for a given categorical random variable Y taking values in \(\mathcal {Y}=\{1, \ldots , k\}\), given a model distribution q, is given by

$$\begin{aligned} {\mathcal {H}_q}(Y) = - \sum _{y \in \mathcal {Y}} q(y) \log q(y), \end{aligned}$$

where the q subscript in \({\mathcal {H}_q}\) emphasises that we consider the entropy under the model’s predictive distribution (in contrast with the entropy of the true predictive posterior). Similarly, for a continuous random variable, the entropy is written as

$$\begin{aligned} {\mathcal {H}_q}(y) = - \int q(y) \log q(y) \, \,dy. \end{aligned}$$

With this, we define the entropy permutation feature importance (Entropy-PFI) as

$$\begin{aligned} \text {PFI}_\mathcal {H}(j)&= \mathbb {E}_{X, \widetilde{X}_j}\left[ {\mathcal {H}_q}(Y|{X_{-j}}, X_j=\widetilde{X}_j) - {\mathcal {H}_q}(Y| X)\right] , \end{aligned}$$

and we get the Monte–Carlo approximation using a test set as

$$\begin{aligned} \widehat{\text {PFI}}_\mathcal {H}(j)&= \frac{1}{n} \sum _{i=1}^n {\mathcal {H}_q}(Y|{X_{-j}}=\textbf{x}_{-j}^{(i)}, X_j=\widetilde{x}_j) - {\mathcal {H}_q}(Y| X=\textbf{x}^{(i)}). \end{aligned}$$

We note that in the continuous case, the entropy may be negative as well as positive. For the regression models we consider, the predictive distribution will be Gaussian, and the entropy of that distribution may be written as

$$\begin{aligned} \mathcal {H}_q(y) = \frac{1}{2} + \frac{1}{2} \log (2 \pi \sigma ^2), \end{aligned}$$

where \(\sigma ^2\) is the variance of the distribution. This value is negative when \(\sigma ^2\) is sufficiently small.

We can think of Entropy-PFI as measuring how much uncertainty increases on average when we replace the j-th feature of an example with a random sample from its marginal distribution. Intuitively, we would expect for this value to be non-negative: by replacing this feature, we will often be moving away from dense regions in the sample space, where the model will have low epistemic uncertainty, to sparser ones, where the model will have seen fewer examples and therefore exhibit lower confidence in its predictions. So, a high PFI score means that the value of a feature helps to increase the confidence of the model in its prediction. Note that, unlike the Likelihood-PFI, the Entropy-PFI is independent of the ground-truth label.

3.3 Properties of entropy-PFI

In interpreting Entropy-PFI, it is important to understand exactly what the quantity measures. For a given feature, it gives a measure of how much the value of that feature supports the model’s conclusion derived from the other features. If the feature shares task-relevant information with other features, the model will be more confident when the feature under consideration agrees with those features and less confident when the relationship between the feature values is destroyed. Entropy-PFI establishes the difference between these two levels of confidence by comparing the level of confidence under the true distribution against the level of confidence when the feature of interest follows the same marginal distribution, but is independent of the other features.

In Fig. 2, we see a mock-up of how permuting features affects the entropy of a test set by moving examples from low-entropy regions to high-entropy ones. In the left panel, the contour plot shows the values of the entropy over the feature space, and each dot represents a member of the test set. We see that the test examples all occur in low-entropy areas: the combined information of the two features allows the model to be certain in its prediction. In the centre panel, we perform PFI on the second feature. Here, we see that there are test points which are in high-entropy areas: we can think of this as the model being surprised by the combination of features, and increasing its level of uncertainty as a result. On the right, we see the histogram of the entropies. We see that when permuted, there are now more high-entropy points, and the average has increased.

Fig. 2
figure 2

Visualisation of effects of PFI. The colour of each point shows the cluster to which the original (unpermuted) test example belonged. In the left panel, the original test set is shown, along with the contour lines for the entropy of a (hypothetical) model’s predictive distribution. In the centre panel, the test set is shown after permuting feature 2. In the right panel, histograms of the entropy before and after permuting the second feature are shown

Intuitively, we can think of Entropy-PFI as answering the question “How much does the true value of this feature support the given prediction on the basis of the evidence provided by the other features?”. A consequence of this interpretation is that we would expect that if a feature did not share any information with other features, the Entropy-PFI would be zero. In the following proposition, we verify that this is the case.

Proposition 1

If \({X_{-j}}\) is independent of \(X_j\), then the Entropy-PFI is zero.

Proof

Starting from the definition of \(\text {PFI}_\mathcal {H}\), we use the fact that \(X_j\) is independent of \({X_{-j}}\) to swap it with \(\widetilde{X}_j\):

$$\begin{aligned} \text {PFI}_\mathcal {H}(j)&= \mathbb {E}_X \left[ \mathbb {E}_{\widetilde{X}_j} [{\mathcal {H}_q}(Y|{X_{-j}}, X_j=\widetilde{X}_j)] - {\mathcal {H}_q}(Y|X) \right] \\&= \mathbb {E}_X \left[ \mathbb {E}_{X_j} [{\mathcal {H}_q}(Y|{X_{-j}}, X_j)] - {\mathcal {H}_q}(Y|X) \right] \\&=\mathbb {E}_X \left[ {\mathcal {H}_q}(Y|X) - {\mathcal {H}_q}(Y|X) \right] =0 . \end{aligned}$$

\(\square\)

We note that this is not the case for the Likelihood-PFI, or the PFI of any target-dependent measure in general, where dependencies between \(X_j\) and Y prevent the substitution used in the proof above. Intuitively, the result is caused by the fact that if a feature is independent, resampling that feature (e.g., by swapping its value with another from the same test set) will generate another sample from the same underlying data distribution. The above proposition means that when a feature is independent of the others, it does not globally affect the confidence of the model. However, this does not mean that the feature does not affect the model’s confidence locally, only that local effects cancel out in aggregate.

A feature being independent of the complementary set is not the only way that Entropy-PFI can be zero. It can also be zero if the predictive distribution does not depend on the feature of interest. This is analogous to how traditional PFI will be zero if a feature is not used in determining the model output.

Proposition 2

If the predictive distribution is not dependent on feature j, i.e., \(q(Y|X) = q(Y|{X_{-j}})\), then \(\text {PFI}_\mathcal {H}(j)=0\) and \(\text {PFI}_\mathcal {L}(j)=0\).

Proof

By assumption, and by definition of \({\mathcal {H}_q}\), we have that \({\mathcal {H}_q}(Y|{X_{-j}}, X_j) = {\mathcal {H}_q}(Y|{X_{-j}})\). By definition of \(\text {PFI}_\mathcal {H}\), we therefore have

$$\begin{aligned} \text {PFI}_\mathcal {H}(j)&= \mathbb {E}_X \left[ \mathbb {E}_{\widetilde{X}_j} [{\mathcal {H}_q}(Y|{X_{-j}}, X_j=\widetilde{X}_j)] - {\mathcal {H}_q}(Y|X) \right] \\&= \mathbb {E}_X \left[ {\mathcal {H}_q}(Y|{X_{-j}}) - {\mathcal {H}_q}(Y|{X_{-j}}) \right] =0 . \end{aligned}$$

Similar reasoning gives the result for \(\text {PFI}_\mathcal {L}(j)\):

$$\begin{aligned} \text {PFI}_\mathcal {L}(j)&= \mathbb {E}_{X,Y} \left[ q(Y|X)\log q(Y|X) \right. \\&\left. \quad - \mathbb {E}_{\widetilde{X}_j} [q(Y|{X_{-j}}, X_j=\widetilde{X}_j) \log q(Y|{X_{-j}}, X_j=\widetilde{X}_j)] \right] \\&= \mathbb {E}_{X, Y} \left[ [q(Y|{X_{-j}})\log q(Y|{X_{-j}}) - q(Y|{X_{-j}})\log q(Y|{X_{-j}}) \right] =0 . \end{aligned}$$

\(\square\)

It is worth noting that both of the above propositions apply to computing the over the data distribution. In practice, the empirical Entropy-PFI will be non-zero, but its typical deviation from zero will decrease with test set size.

3.4 Why conditional PFI is not useful in the context of entropy

At their core, permutation-based feature importance methods rely on re-sampling features based on their marginal distribution, breaking the relationship between the feature and the set of all other variables. However, doing this leads to undesirable outcomes: by ignoring correlations and other relationships between feature variables, we can find ourselves evaluating the model on points outside the true data distribution. As discussed in depth by Hooker et al. (2021), this can be problematic in that the resulting feature importances rely on the extrapolating behaviour of the model, giving importance measures that are dependent on the model’s behaviour in regions far from the training data, where the model’s behaviour is unlikely to reflect the true distribution of the data.

One of the proposed methods for dealing with this is through conditional approaches, where a variable \(\widetilde{X}^c_j\) is constructed so that it retains the same conditional relationship with \({X_{-j}}\) as \(X_j\), but is independent of Y given the information contained in \(X_j\). However, such approaches are at odds with what Entropy-PFI measures: Entropy-PFI is non-zero only when features share task-dependent information and permuting one of the features breaks the dependency between one feature and the set of other features.

Say that we have access to a random variable \(\widetilde{X}^c_j\) such that \((\widetilde{X}^c_j, {X_{-j}})\) has the same joint distribution as \((X_j, {X_{-j}})\) but \(\widetilde{X}^c_j\) is independent of Y (either completely or when conditioned on \({X_{-j}}\)). We can define Likelihood-PFI in the same way that conventional PFI is defined for this approach. However, if we attempt to define conditional Entropy-PFI in the same way, e.g.,

$$\begin{aligned} \text {CPFI}_\mathcal {H}(j) = \mathbb {E}_{X, \widetilde{X}^c_j}[{\mathcal {H}_q}(Y|{X_{-j}}, X_j=\widetilde{X}^c_j) - {\mathcal {H}_q}(Y|X)]. \end{aligned}$$
(7)

we cannot use it to get estimates of the uncertainty caused by feature \(X_j\), since this will always be zero, as shown in the following proposition.

Proposition 3

The entropy version of conditional PFI, as defined in Eq. (7), is zero for all features.

Proof

Considering an arbitrary feature indexed by j, we look at the first term in more detail, finding

$$\begin{aligned} \mathbb {E}_{X, \widetilde{X}^c_j}[{\mathcal {H}_q}(Y|{X_{-j}}, X_j=\widetilde{X}^c_j)] = \mathbb {E}_{X, \widetilde{X}^c_j}\left[ \int u(y|{X_{-j}}, X_j=\widetilde{X}^c_j) dy\right] , \end{aligned}$$

where \(u(\cdot ) = q(\cdot )\log q(\cdot )\) for the sake of brevity. Using the fact that \((\widetilde{X}^c_j, {X_{-j}})\) has the same joint distribution as \((X_j, {X_{-j}})\) by definition, we can expand the expectations to give

$$\begin{aligned} \mathbb {E}_{X, \widetilde{X}^c_j}&[{\mathcal {H}_q}(Y|{X_{-j}}, X_j=\widetilde{X}^c_j)]\\ =&-\int \int \int u(y| {X_{-j}}=\textbf{x}_{-j}, X_j=\widetilde{x}_j) P_{{X_{-j}}, \widetilde{X}^c_j}(\textbf{x}_{-j}, \widetilde{x}_j) \,dy d\textbf{x}_{-j} dx_j \\ =&-\int \int \int u(y| {X_{-j}}=\textbf{x}_{-j}, X_j=\widetilde{x}_j) P_{{X_{-j}}, X_j}(\textbf{x}_{-j}, \widetilde{x}_j) \,dy d\textbf{x}_{-j} dx_j\\ =&-\int u(y| X=\textbf{x}) \,dy\, P_{X}(\textbf{x}) d\textbf{x}\\ =&\mathbb {E}_{X} \left[ {\mathcal {H}_q}(Y|X) \right] . \end{aligned}$$

Plugging this into Eq. (7) gives that the conditional Entropy-PFI is zero. \(\square\)

Therefore, we see that, by definition, conditional approaches for PFI are ineffective in measuring the importance of features in determining entropy. However, we make two arguments as to why this is not problematic. Firstly, one of the ways in which Entropy-PFI has utility is in identifying when shared information between features has the effect of boosting model confidence; in attempting to preserve shared information between the feature of interest and the other features via conditioning during permutation, we eliminate the very discrepancy that we aim to measure. Secondly, while relying on the extrapolation behaviour of a model is in general undesirable, especially when the underlying data generating process is of interest, for an uncertainty-aware model, its ability to display increased (epistemic) uncertainty when extrapolating is one of the key desirable characteristics of the model.

3.5 How to interpret entropy-PFI

Given that Entropy-PFI becomes zero when a feature is independent of the others or when we construct conditional variants of it, it is clear that its behaviour is not exactly analogous to that of traditional PFI. In contrast, Likelihood-PFI can be thought of as a direct port of the traditional loss-based PFI to an uncertainty-aware setting, inheriting the known properties and issues from PFI itself.

As such, it is worth spending some time discussing exactly how Entropy-PFI should be interpreted and how it can be used to inform us about our model before we attempt to use the measure in practice. In this section, we discuss how Entropy-PFI can be used in conjunction with Likelihood-PFI to derive insights from the data, as well as overcome some of Likelihood-PFI’s shortcomings.

When entropy-PFI is zero

In Propositions 1 and 2, we gave two settings in which the Entropy-PFI is zero: one in which the feature is informative to the model but independent of other features, and the other in which the feature does not contain any information used in determining the predictive distribution. These are two cases that it is important to differentiate between, so it is reasonable to ask whether this means Entropy-PFI is too weak a tool to truly explain model behaviour.

Though Entropy-PFI in isolation does not allow us to discern the difference between the two scenarios, in both these cases, Entropy-PFI still serves as a useful complement to Likelihood-PFI to explain the role of the given feature in the model’s prediction. If the Likelihood-PFI is also zero, this suggests that the feature is uninformative, as changing its value does not have an impact on model performance. On the other hand, if Likelihood-PFI is positive, it means that the feature is informative, but the information it contains about the target variable is not shared with other features.

When entropy-PFI is non-zero

In the case where the Entropy-PFI is non-zero, this means that (a) the feature contains information about the target variables and (b) some (or all) of that information is also contained jointly in the other features. From Molnar (2022), we know that if information from a feature is available from other sources, the importance of that feature as measured by PFI will be diminished. As such, when we see a high Entropy-PFI, we should expect that the Likelihood-PFI under-reports the utility of the information contained in that variable, since the model will sometimes be extracting that information from a different source.

The role of extrapolation

Entropy-PFI measures the difference of a quantity in expectation over the marginal and the joint distributions of the feature of interest (and in expectation over the joint distribution of all the other features). For this to be non-zero, these distributions need to be different, and the marginal distribution will necessarily put more weight on some areas which are low/zero density in the joint distribution. For this reason, the measure can be thought of as inherently dependent on how a model determines its confidence levels in these regions. In an extreme case where the model gives the same uncertainty estimate everywhere, the Entropy-PFI will be zero for all features, and Likelihood-PFI reduces to PFI.

However, uncertainty-aware models are, in general, designed to give higher uncertainty estimates in regions where they have been presented with little data. This scenario is modeled in Fig. 2, with the increase in uncertainty being caused by the model being less confident in out-of-distribution regions. Under this assumption, we can think of the uncertainty as being largely epistemic in nature: where training data is sparser, uncertainty-aware models will exhibit higher uncertainty. Additionally, it is under this assumption that we expect Entropy-PFI to be non-negative: a model will likely be more confident interpolating than extrapolating. However, we also note that model behaviour will vary even between uncertainty-aware models, depending on their extrapolation behaviours and this variation is an additional factor to consider in interpreting model behaviours with regards to uncertainty. While these tools are model-agnostic, they require practitioners to be aware of a model’s behaviour in order to correctly interpret.

What kind of uncertainty do we measure?

The mechanism by which Entropy-PFI works can be most easily understood in terms of epistemic uncertainty, but it is worth noting that it does not mean that Entropy-PFI is just measuring epistemic uncertainty. Entropy-PFI explicitly measures the uncertainty of the predictive distribution, regardless of whether that role is epistemic or aleatoric in nature. Given the role of extrapolation in calculation of Entropy-PFI, epistemic uncertainty will likely be significant component, but the model’s understanding of aleatoric uncertainty in different regions of the feature space will also affect the importance of features.

Moreover, Entropy-PFI is dealing with the predictive uncertainty of the model over the whole feature space. While thinking in terms of epistemic and aleatoric uncertainty is useful in reasoning about a model’s behaviour and interpreting the behaviour of our measures, it is important to consider that a model’s understanding of these sources of uncertainty will likely be flawed/incomplete. In the notable case where the model has a built-in assumption of homoscedastic noise, then Entropy-PFI will indeed be measuring the model’s understanding of its uncertainty due to lack of data. However, in general, the model’s understanding of how aleatoric uncertainty changes over the joint feature distribution will play a role in measuring the uncertainty. As such, while we can use our understanding of how models deal with epistemic and aleatoric uncertainty to interpret the results, we should not rely on our measures as a tool to distinguish between these two sources of uncertainty. Instead, we may use our measures as a tool operating on the uncertainty of the overall predictive distribution.

3.6 Examples of joint usage of entropy-PFI and likelihood-PFI

We now present two examples of synthetic datasets in which Entropy-PFI and Likelihood-PFI together can enhance interpretability in terms of feature importance and predictive uncertainty. In particular, one classification and one regression example are presented.

Classification experiment on synthetic data

In order to examine the interpretation of Entropy-PFI and Likelihood-PFI in a classification setting, we consider a toy binary classification dataset simulated according to the model

$$\begin{aligned} P(Y=1 | \textbf{x}) = \epsilon + (1-2\epsilon )\,\mathbbm {1}\left( \sum _{j=1}^J \textbf{x}_{j} > \frac{J}{2}\right) , \end{aligned}$$

where \(\textbf{x}\) is sampled uniformly from the unit hypercube \([0,1]^d\) (Mease and Wyner 2008). Here, \(\epsilon\) is the amount of label noise, d is the total number of features, and J is the number of relevant features. For our experiments, we set \(d=10\), \(J=4\) and \(\epsilon =0.1\). This means that there are four features all with equal importance (features 1, 2, 3 and 4) and six features which are irrelevant in determining the target class. We consider three versions of this data: one that is exactly as described above, a second in which feature 10 is replaced with a copy of feature 1, and a third in which feature 10 is replaced with a copy of feature 5.

For each of the three generation methods discussed above, we generate a dataset of 5000 examples with a train/test split of 3750/1250. A separate calibrated random forest is trained on the training set for each of the three datasets, then Likelihood-PFI and Entropy-PFI is computed for each on their respective test sets. In Fig. 3, we see how the Likelihood-PFI and Entropy-PFI are affected by adding redundancy to the dataset for a random forest with calibration (see Appendix B for details).

In the original dataset, all four task-relevant features have similarly high Likelihood-PFI values. However, when feature 10 is replaced with a copy of feature 1, the first feature is now ranked as less important; this is because there is now an alternate source from which the model can get the same information.

Fig. 3
figure 3

Comparison of likelihood-PFI and entropy-PFI for three datasets, the second and third of which contain redundant features. When feature 10 is a copy of feature 1 (an informative feature), we see PFI-likelihood of feature 1 drop and PFI-entropy increase, and both PFI-likelihood and PFI-entropy increase for feature 10. When feature 10 is a copy of feature 5 (an uninformative feature), there is no effect on PFI-likelihood for either feature, and a small increase in PFI-entropy for both

In contrast, features 1 and 10 are the features that Entropy-PFI identifies as most important in making the model confident in its output, as shown in the right panel of the figure. This makes sense: in all the training data, these features have been strongly correlated (identical, in fact), and therefore examples where this relationship is broken should be treated as out-of-distribution, which should be reflected in greater predictive uncertainty.

We also see a small increase in Entropy-PFI for the redundant features in the third dataset. This is likely due to the fact that, despite features 5 and 10 not containing any information about the target, spurious correlations in the training set may cause the model to use these features, and therefore the model is able to identify when it goes out-of-distribution due to disagreement between the two values, resulting in changes in entropy. However, we note that this effect is small in comparison to the effect in features that are informative and therefore are actually useful to the model. Indeed, as shown in Proposition 2, if the model (correctly) learns to disregard both features, the Entropy-PFI should be zero.

Regression example on synthetic data

A synthetic dataset is simulated from the regression model

$$\begin{aligned} Y = X_1 + X_2 + 0.9 X_3^2 + X_4 + X_5 + \varepsilon , \end{aligned}$$

where Y is a random variable dependent on feature variables \(X_1, \ldots , X_5\). We sample the features from the following Gaussian distributions:

$$\begin{aligned} (X_1, X_2),~(X_3, X_4) \sim \mathcal {N}\left( \begin{pmatrix} 0\\ 0 \end{pmatrix}, \begin{pmatrix} 1 &{} 0.8\\ 0.8 &{} 1 \end{pmatrix} \right) ,~ X_5\sim \mathcal {N}(0, 1),~ \varepsilon \sim \mathcal {N}(0, 2). \end{aligned}$$

Apart from the stated relationships, the features are otherwise all independent of each other. We train a Gaussian process regression model on 500 examples drawn from this distribution and generate an additional 500 test samples to be used for the importance measures. In Fig. 4, we show the Entropy-PFI and Likelihood-PFI, using the predictive distribution from the Gaussian process. Note that this predictive distribution includes uncertainty from the predicted noise, as well as the predicted distribution of the mean.

Fig. 4
figure 4

Likelihood-PFI and entropy-PFI for features in a synthetic dataset using a Gaussian process model. Since features 1–4 share information with each other, their likelihood-PFI is reduced relative to the independent feature 5. In contrast, their shared information means that they have higher entropy-PFI, where feature 5’s entropy-PFI is negligible

In Fig. 4, we show the importance of each feature as measured by Likelihood-PFI and Entropy-PFI. For Likelihood-PFI, we observe many of the known properties of PFI under typical loss functions (Molnar 2022): having the same information shared between multiple features (e.g., having a large covariance between features \(X_1\) and \(X_2\)) diminishes their importance to the model relative to features that contain no shared information (e.g., feature \(X_5\)). In contrast, because the features \(X_1\) and \(X_2\) contain shared information, their Entropy-PFI is high: when the connection between these features is broken by permuting one of their values in the test set, the model is forced to extrapolate, resulting in higher predictive uncertainty. We see similar behaviour in the measures for \(X_3\) and \(X_4\), despite their contributions to Y being related in a more complex way. In both cases, the Entropy-PFI being high is suggestive of the fact that the Likelihood-PFI is likely lower than if the feature were independent of all others, as some of the information contained in that feature is also available from other sources.

Although we see that feature \(X_5\) is considered important in determining the negative log-likelihood (i.e., the model uses information from feature \(X_5\) in order to make an accurate prediction), it is not considered important in determining the uncertainty (i.e., on average, knowing feature \(X_5\) neither increases nor decreases the model’s confidence in its prediction); again, this is due to the feature being independent from the others.

Using this dataset, we are also able to demonstrate that the Entropy-PFI is not purely a function of epistemic uncertainty, but that the aleatoric uncertainty of the model also plays a role in determining its value. To this end, we train two more models with similar datasets, but this time changing the amount of noise in the target variable. In particular, we consider the cases where \(\varepsilon \sim \mathcal {N}(0, 0.5)\) and \(\varepsilon \sim \mathcal {N}(0, 1)\). We plot the results of all three experiments side-by-side in Fig. 5.

Fig. 5
figure 5

Likelihood-PFI and entropy-PFI for datasets varying the amount of noise in the target variable. The datasets are the same as in Fig. 4, but with the variance of \(\epsilon\) set to the value \(\sigma ^2\) in each case. We see that when reducing the amount of noise, both the likelihood-PFI and entropy-PFI increase

With less noise in the target, we would expect that the predictive distribution of a well-calibrated model to packed more densely around the predictive mean; for this reason, it makes sense that permuting a feature would lead to a prediction which is more harshly punished by the loss, and that the Likelihood-PFI would be larger. Similarly, we observe that the increase in uncertainty when permuting the features is significantly greater.

4 PDP and ICE for entropy and likelihood

Entropy-PFI and Likelihood-PFI inherit the property from the original PFI of being global measures of importance. In the previous section, we saw these two measures can complement each other to give useful insights into the global behaviour of a model’s predictive distribution. However, there are limitations to this approach. As we saw in Proposition 1, if a feature’s distribution is independent of all others, the Entropy-PFI will be zero even if the uncertainty is greater for some values in the feature’s range than others. To alleviate this issue, and to get a more fine-grained explanation of the effects of features on uncertainty, we need an alternate approach.

4.1 Adapting PDP and ICE for entropy and likelihood

To tackle the issues discussed above, we examine how we may define partial dependence plots for entropy (Entropy-PDPs). We define these as follows:

$$\begin{aligned} \text {PDP}_\mathcal {H}(x; j) = \mathbb {E}_{{X_{-j}}} [{\mathcal {H}_q}(Y|{X_{-j}}, X_j=x)], \end{aligned}$$
(8)

and we approximate this value using a test set in the following way:

$$\begin{aligned} \widehat{\text {PDP}}_\mathcal {H}(x; j) = \frac{1}{n} \sum _{i=1}^n {\mathcal {H}_q}(Y|{X_{-j}}=\textbf{x}_{-j}^{(i)}, X_j=x). \end{aligned}$$
(9)

Entropy-PDP can be reasoned about in the same way as traditional PDP. Given the marginal distribution of \({X_{-j}}\), the Entropy-PDP at point \(\textbf{x}\) tells us the amount of entropy that we would see in expectation over that distribution if we fixed the jth feature’s value to x. We similarly define Entropy-ICE plots with

$$\begin{aligned} {\text {ICE}_\mathcal {H}}^{(i)}(x; j) = {\mathcal {H}_q}(Y|{X_{-j}}=\textbf{x}_{-j}^{(i)}, X_j=x). \end{aligned}$$
(10)

PDPs and ICE plots tend to be used on the model output itself. Entropy-PDPs and Entropy-ICEs can be seen as extensions of this idea, but using a statistic derived from the model output distribution, as opposed to using the model output directly.

Entropy-PDP shows how sensitive the uncertainty of a model is to changes in the value of the j-th feature. Regions where this value is lower can indicate that a feature is strongly informative to the model’s output, typically adding confidence to the model’s prediction. On the other hand, regions where this value is higher indicate increased uncertainty from the model. This can be due to some feature values being intrinsically linked to higher aleatoric uncertainty, or due to epistemic uncertainty caused by the model being forced to extrapolate for some test examples. These two cases can be distinguished by examining individual ICE curves; in the former case, the ICE curves will show an increase in the region for all test points, while in the latter, we would expect the uncertainty to be lower for curves where the original \(x_j\) value falls within that region.

We now further extend this notion, looking at the likelihood of the true label given the features. In this way we define the PDP for likelihood (Likelihood-PDP) as

$$\begin{aligned} \text {PDP}_\mathcal {L}(x; j) = -\mathbb {E}_{Y, {X_{-j}}}[\log q(Y| {X_{-j}}, X_j=x)]. \end{aligned}$$

Note that this is defined in terms of the negative log-likelihood. The PDP for likelihood is approximated on a test set via averaging:

$$\begin{aligned} \widehat{\text {PDP}}_\mathcal {L}(x; j) = -\frac{1}{n} \sum _{i=1}^n \log q(y^{(i)}| {X_{-j}}=\textbf{x}^{(i)}_{-j}, X_j=x). \end{aligned}$$

While Likelihood-PFI can be thought of as a relatively straightforward adaptation of PFI to an uncertainty-aware setting, Likelihood-PDP is more of a departure from the original PDP, in the sense that while the original PDP measures the value of the function itself, Likelihood-PDP measures the model’s performance on a test set and is explicitly target-dependent.

Again, we can extend this to ICE curves for the likelihood as follows:

$$\begin{aligned} {\text {ICE}_\mathcal {L}}^{(i)}(x; j) = - \log q(y^{(i)}| {X_{-j}}=\textbf{x}^{(i)}_{-j}, X_j=x). \end{aligned}$$

Compared to Entropy-PDP, the interpretation of Likelihood-PDP curves is relatively simple: higher values mean that the model finds it harder to make accurate confident predictions in a region, while lower values mean that it is easier for the model to predict accurately in a region. Likelihood-ICE curves have a similarly simple interpretation: a Likelihood-ICE curve shows how well the model is able to predict the target variable given \(\textbf{x}_{-j}^{(i)}\) and the given value for \(x_j\). Higher regions are regions where the value for \(x_j\) misleads the model, decreasing the likelihood of the target, while lower regions indicate that the given value of \(x_j\) makes the model more confident and accurate.

4.2 A toy example for entropy-PDP and entropy-ICE

To offer an interpretation of the Entropy-PDP plots—and in particular to highlight the importance of also plotting Entropy-ICE plots—we consider a synthetic dataset, the distribution of (training) points for which is shown on the left of Fig. 6. We observe that the data is distributed around the border of the feature space, with no examples lying on the interior. This means that an uncertainty-aware model should exhibit high epistemic uncertainty when neither feature takes on an extreme value (i.e., \(-1.5< X_1, X_2 < 1.5\), as shown by the hatched area), and lower uncertainty when either feature takes on a more extreme value (i.e., either \(|X_1|>1.5\) or \(|X_2|>1.5\)). The target variable (not shown) is of the form \(Y= (X_1 + X_2)^2 + 0.1 \epsilon\), where \(\epsilon \sim \mathcal {N}(0, 1)\). Note that by symmetry of the features in the data-generating process, both features should appear equally important to any reasonable model and importance measure; therefore, little insight is to be gained from PFI or similar methods. However, we can still gain an understanding of our model’s uncertainty by looking at Entropy-PDP and Entropy-ICE.

Fig. 6
figure 6

Visualisation of the distribution of synthetic dataset (left) and entropy-PDP plots for each feature (right). The vertical dotted lines on the entropy-PDP plot show where the “interior” of the distribution begins

In the right-hand plot of Fig. 6, we see the Entropy-PDP plot for a Gaussian process model on a test set drawn from the same distribution as the training set. We observe high uncertainty both in the interior values and at very extreme values. We can hypothesise that the uncertainty in the interior is caused by examples where the feature not under consideration is also mid-range, causing the model input to be out-of-distribution and, therefore, for the model to exhibit high epistemic uncertainty. Similarly, at the most extreme values, the uncertainty may be higher, as there are fewer proximal training examples than in the centre of the bands.

This is a case where PDP fails to capture the heterogeneity of the data: we see that uncertainty increases for interior values but have no information about whether this is true for all possible values of the other feature or just some values.

Fig. 7
figure 7

Entropy-ICE plots for randomly sampled test points for each feature. The PDP curve is shown in black. All other lines are ICE plots, with the colour of the line showing the value of the constant feature (i.e., \(\textbf{x}_{-j}^{(i)}\)). The point on each line shows the value of the feature in the original example used to construct the ICE curve

The fact that the Entropy-PDP can be affected by extrapolation in this way is a reason to proceed with caution when analysing these plots. We therefore additionally look at Entropy-ICE plots (Fig. 7). In these plots, we can see two distinct behaviours: if we consider the Entropy-ICE plot for the first feature, for examples where feature \(X_2\) has small magnitude, we observe higher uncertainty when the feature under consideration (\(X_1\)) is also small. As hypothesised, this is due to the example being out-of-distribution and, therefore, having high associated uncertainty. On the other hand, when the complementary feature takes a more extreme value, the example generated will be in-distribution and, therefore, have lower associated uncertainty.

As an example of how Entropy-ICE curves may also be supplemented with additional information, Fig. 7 shows the values taken by the feature in the examples from which each ICE curve is constructed. This gives us further evidence of extrapolation in Entropy-PDP by demonstrating that none of the original test examples has entropy as high as the peak of the Entropy-PDP. We also observe that the behaviour of the ICE curves at extreme values is more uniform: for all test points, the model uniformly becomes more uncertain as the feature value approaches the edge of the distribution, in contrast with the heterogeneous behaviour for central values.

5 Experiments using real-world datasets

In this section, we examine how Entropy-PFI and Likelihood-PFI can be used in practice to gain insights into how various probabilistic models make their predictions. We consider a variety of models in both classification and regression settings.

5.1 Regression example: concrete dataset

In this example, we show how the proposed methods can be used to gain insight into the behaviour of models on a real-world regression dataset. Here, we demonstrate how Likelihood-PFI and Entropy-PFI give complementary explanations for the behaviour of uncertainty-aware regression models. We consider the UCI concrete dataset (Yeh 2007). We use two uncertainty-aware models: a Gaussian process with a radial basis function (RBF) kernel and a neural network using Monte–Carlo (MC) dropout (Gal and Ghahramani 2016). Details of the configurations for both models can be found in Appendix B. We use 75% of the dataset for training and the other 25% for testing.

In Fig. 8, we see the relative importance of each feature in terms of both Likelihood-PFI and Entropy-PFI. We observe that although age is the most important feature in terms of the likelihood, its (global) effect on the entropy is small. This suggests that age is important in accurately predicting the target variable (i.e., the value of the feature will often have a significant effect on the likelihood), but that it is not strongly related to any of the other features.

Fig. 8
figure 8

Comparison of entropy-PFI and likelihood-PFI for neural networks with Monte–Carlo dropout and Gaussian processes fitted to the UCI concrete dataset

We can verify this by examining how effectively we can train regression models to learn each feature in the dataset given the others. In Fig. 9, the coefficient of determination when a random forest regression model model is trained to predict one feature given that we observe all the other features (on the right, this includes the target). We see that indeed age appears to be independent of the other features: knowing all the other features does not provide reliable information about age. However, the prediction of the age variable improves significantly when we have access to the target variable, suggesting that the target and age share information.

Fig. 9
figure 9

Comparison of coefficients of determination for models predicting one feature’s value given the others (UCI concrete dataset)

Note that in concluding that age is independent of the other features, we draw on several observations. The fact that the Likelihood-PFI is high for the feature discounts the possibility that the Entropy-PFI being zero is simply a result of the model discarding the feature and not making use of it in determining the predictive distribution. Furthermore, the fact that the Entropy-PFI is non-zero for other features means that the model is indeed uncertainty-aware, and is not just using the same distribution with shifted mean for each point.

However, as previously noted, just because the Entropy-PFI is small/zero, it does not mean that the entropy is not affected locally by the specific value of the feature. To better understand this, in Fig. 10 we plot the Entropy-PDP curve for age for both models, along with a few randomly chosen ICE curves. This figure highlights the fact that Entropy-PFI is a global property: despite Entropy-PFI being near zero, we see that entropy varies not only as we change the feature value (shown by how the PDP curve changes as the value for age does), but is also affected by the values that other features take (shown by the variation in characteristics of the ICE curves).

Fig. 10
figure 10

Entropy-ICE and entropy-PDP plots for age feature (UCI concrete dataset). The thicker blue curves are the PDP curves, with ICE curves for examples in the test set shown in grey. The original values for the feature for each ICE curve are shown as grey dots lying on each grey curve

5.2 Classification example: diabetes dataset

To demonstrate the utility of our approach for uncertainty-aware models in a classification setting, we examine the importance of various features for models trained on the UCI diabetes dataset Smith et al. (1988). In particular, we examine two models: calibrated random forests and deep neural networks with weight uncertainty, also known as Bayes by backprop (BBB) (Blundell et al. 2015). Further details of the configurations of both models are given in Appendix B. The UCI diabetes dataset was also used by Breiman (2001) as one of the first applications of PFI to explain model behaviour. For this example, we first review the findings of Breiman (2001), before examining what additional insights could be gleaned from our new approach.

In Breiman (2001), it is observed that the second feature (plas) is the most important, followed by age (feature 8) and mass (feature 6). Through additional experiments, Breiman have also shown that while feature 8 contains useful information about the target label, the predictive information that it contains is redundant with the information contained in feature 2; hence, training a model with or without this feature has little effect on the model’s predictive power.

Fig. 11
figure 11

Entropy-PFI and likelihood-PFI values for calibrated random forests and Bayes by backprop neural networks fitted to the UCI diabetes dataset. Ordered by likelihood-PFI

Where Breiman (2001) measured the percentage increase in classification accuracy, we measure the difference in likelihood as defined in Eqs. 5 and 6. Doing so, we observe in Fig. 11 the same phenomena occur for the likelihood in our random forest model as occurred for classification accuracy in Breiman’s: plas (feature 2) is the most important, with age (feature 8) and mass (feature 6) also having significant effects on the model’s predictive power. We observe similar results for an MLP using BBB.

In Fig. 12, we show the Entropy-PDP and Entropy-ICE plots for plas (feature 2). We see that for lower values of the feature, there is a relatively low entropy for both models. We also observe a significant increase in uncertainty for higher values, peaking at about 150. Examining the distribution of the feature in the training set, separated by class, we see that at extreme values, the feature is strongly informative of class label, but at values in the region 100–170, both classes are present with high frequency, leading the feature to be less informative and therefore rendering models less confident in their predictions based on information from this feature for values in this range.

We can also see the effect of this change in the confidence level of the models on the negative log-likelihood in Fig. 13. For examples in the positive class, the model becomes more confident in its correct prediction (given the other features) as the plas value increases, leading to a decrease in the loss for those examples. For the negative class, the opposite is true: The model becomes more confident in its prediction as the feature value decreases. Again, we can interpret this in terms of the confidence increasing as the feature of interest adds evidence to support the conclusion inferred from the other features.

Fig. 12
figure 12

Entropy-PDP (orange) and entropy-ICE curves for the UCI diabetes dataset for plas (feature 2). Left: curves for a calibrated random forest. Middle: curves for Bayes by backprop. Right: distribution of the feature value in the training set for both classes. In the left and middle figures, the bars along the bottom show values of the feature for examples in the test set of the positive (red) and negative (blue) classes. The red and blue lines are ICE curves for test examples of the two classes, and the orange curves are the entropy-PDP curves

Fig. 13
figure 13

Likelihood-PDP (blue, bold) and likelihood-ICE curves for the UCI diabetes dataset for plas (feature 2). Left: curves for calibrated random forest. Middle: curves for Bayes by backprop. Right: distribution of feature value in training set for both classes

We observe a similar phenomenon for mass (feature 6) in Fig. 14. In particular, for members of the positive class, the ICE curves generally show a greater amount of uncertainty for low feature values. This may be due to the training set not containing examples from the positive class where this value is low and therefore the examples constructed for the ICE curve are out-of-distribution and exhibit high epistemic uncertainty. Note that this is not picked up in the Entropy-PDP, and can only be observed using Entropy-ICE.

In Fig. 15, we again see a difference in behaviour of the Likelihood-ICE curves for the different classes, with the loss being lower for the positive class at higher values for the feature, and for the negative class at lower values.

Fig. 14
figure 14

Entropy-PDP (orange) and entropy-ICE curves for the UCI diabetes dataset for mass (feature 6). Left: curves for a calibrated random forest. Middle: curves for Bayes by backprop. Right: distribution of feature value in training set for both positive and negative classes

Fig. 15
figure 15

Likelihood-PDP (blue, bold) and likelihood-ICE curves for the UCI diabetes dataset for mass (feature 6). Left: curves for a calibrated random forest. Middle: curves for Bayes by backprop. Right: distribution of feature value in training set for both positive and negative classes

6 Conclusions

In this paper, we have proposed modifications of PFI, PDP and ICE that can be used to gain insights into the importance of features in uncertainty-aware models, both in terms of likelihood and uncertainty (as measured by the entropy of the predictive distribution).

Permutation feature importance, amongst other methods, has come under criticism for their shortcomings in forcing the model to extrapolate to unexplored regions in developing explanations. Although the suggested solution is to avoid PFI in favour of methods that explicitly address this issue, the simplicity of PFI and related approaches mean that they nonetheless remain popular. Using Entropy-PFI can mitigate some of these issues, providing additional information about the level of uncertainty in a model that is attributable to a given feature. In particular, Entropy-PFI can be used to identify when a feature is likely to be independent of other informative features, and therefore its feature importance can be trusted.

We note that Entropy-PFI does not completely mitigate the issues raised in (Hooker et al. 2021), and careful interpretation and understanding of the strengths and weaknesses of each method are required. However, given the prevalence of permutation and extrapolation-based importance methods, even in light of recent criticism, having these additional tools serves to mitigate some of the shortcomings of these approaches.

Returning to the medical example we gave in Sect. 1, we now have the tools to analyse our hypothetical model. By examining PDP and ICE curves, we can determine if the given feature is the source of the uncertainty based on how the uncertainty changes as we vary its value. Additionally, we can use Entropy-PFI to determine if the uncertainty is due to disagreements between features.

Throughout, when demonstrating entropy-based feature importance methods, we found it useful in building intuition to make appeals to the readers’ understanding of aleatoric and epistemic uncertainty. However, we did not attempt to separate them. The utility of explaining these two sources of uncertainty, as well as the best methodology for doing so, remain questions for future research.

Similarly, there are many other methods that could be used for explainable uncertainty, such as LIME (Ribeiro et al. 2016), and adaptations of those used in this paper (such as showing derivatives using ICE rather than the original values).