1 Introduction

Only recently, a trend towards the objective empirical validation of XAI methods using ground truth data has been observed (Tjoa & Guan, 2020; Li et al., 2021; Zhou et al., 2022; Arras et al., 2022; Gevaert et al., 2022; Agarwal et al., 2022). These studies are, however, limited in the extent to which they permit a quantitative assessment of explanation performance, in the breadth of XAI methods evaluated, and in the difficulty of the posed ‘explanation’ problems. In particular, most published benchmark datasets are constructed in a way such that realistic correlations between class-dependent (e.g., the foreground or object of an image) and class-agnostic (e.g., the image background) features are excluded. In practice, such dependencies can give rise to features acting as suppressor variables. Briefly, suppressor variables have no statistical association to the prediction target on their own, yet including them may allow an ML model to remove unwanted signals (noise), which can lead to improved predictions. In the context of image or photography data, suppressor variables could be parts of the background that capture the general lighting conditions. A model can use such information to normalize the illumination of the object and, thereby, improve object detection. More details on the principles of suppressor variables can be found in Conger (1974); Friedman and Wall (2005); Haufe et al. (2014); Wilming et al. (2022). Here we adopt the formal requirement that an input feature should only be considered important if it has a statistical association with the prediction target, or is associated to it by construction. In that sense, it is undesirable to attribute importance to pure suppressor features.

Yet, Wilming et al. (2022) have shown that some of the most popular model-agnostic XAI methods are susceptible to the influence of suppressor variables, even in a linear setting. Using synthetic linearly separable data defining an explicit ground truth for XAI methods and linear models, Wilming et al. (2022) showed that a significant amount of feature importance is incorrectly attributed to suppressor variables. They proposed quantitative performance metrics for an objective validation of XAI methods, but limited their study to linearly separable problems and linear models. They demonstrated that methods based on so-called activation patterns (that is, univariate mappings from predictions to input features), based on the work of Haufe et al. (2014), provide the best explanations. Wilming et al. (2023) took this one step further and presented a minimal two-dimensional linear example, analytically showing that many popular XAI methods attribute arbitrarily high importance to suppressor variables. However, it is unclear as to what extent these results would transfer to various non-linear settings. In the context of the lighting condition example, this recent work showed that many popular XAI methods could highlight every pixel containing illumination information as important. If the illumination information is present across all pixels of the image, an explanation could appear to be composed mostly of random noise, presenting little value to a user. We therefore necessitate that good XAI methods should be able to distinguish between truly important features and suppressors, and ideally inform the user of which category the highlighted variables belong to. Alternatively, a good XAI method should be able to highlight only the truly informative features used by a model, and to mask suppressors and other such misleading features.

Thus, well-designed non-linear ground truth data comprising of realistic correlations between important and unimportant features are needed to study the influence of suppressor variables on XAI explanations in non-trivial settings, which is the purpose of this paper. We go beyond existing work in the following ways:

First, we design one linear and three non-linear binary image classification problems, in which different types and combinations of tetrominoes (Golomb, 1996), overlaid on a noisy background, need to be distinguished. Tetrominoes are geometric shapes consisting of four blocks, popularized by the game Tetris ( Nintendo of America, 1989). In all cases, ground truth explanations are explicitly known through the location of the tetrominoes. Apart from the linear case, these classification problems require (different types of) non-linear predictive models to be solved effectively.

Second, based on signal detection theory and optimal transport, we define three suitable quantitative metrics of ‘explanation performance’ designed to handle the case of few important features.

Third, using three different types of background noise (white, correlated, imagenet), we invoke the presence of suppressor variables in a controlled manner and study their effect on explanation performance.

Fourth, we evaluate the explanation performance of no less than sixteen of the most popular model-agnostic and model-specific XAI methods, across three different machine learning architectures. We compare these to four model-agnostic baselines that can serve as null models for explanation performance.

In doing this, we provide the first comprehensive study going beyond linear data. We position the XAI-TRIS datasets and metrics as tools to not only benchmark current XAI methods, but also to guide development of new XAI methods to overcome the susceptibility of false-positive attribution to suppressor variables.

2 Methods

Our workflow of applying and benchmarking post-hoc XAI methods can be seen in Fig. 1. Given a classification dataset generated with an explicitly known ground truth controlling the class-conditional distribution, we train a machine learning model using the training (and validation) split of the data. Taking the trained model and test data (either as individual samples or a batch) as the inputs to the given XAI method, we compute output explanations of the same dimensionality as the input data, aimed to correspond to the importance of each pixel towards the trained model’s prediction output. Finally, we apply novel performance metrics to compare produced explanations and the ground truth for the given sample, giving us the explanation performance of the method, with quantitative and qualitative results visualized in Sect. 4. This section covers each component of Fig. 1, following the experimental pipeline from data generation through to output analyses.

Fig. 1
figure 1

The process of evaluating an XAI method. XAI-TRIS classification datasets are generated through explicitly defined tetrominoes controlling the class-conditional distribution, which serve as the ground truth features for explanations. Given an ML model trained on the given data, the XAI method takes test data and the model as input, producing explanations. These explanations are passed to performance metrics, which use the given ground truth as a basis for comparison

2.1 Data generation

For each scenario, we construct an individual dataset of \(64 \times 64\)-sized images as \(\mathcal {D} = {(\textbf{x}^{(n)},y^{(n)})}_{n=1}^N\), consisting of i.i.d observations \((\textbf{x}^{(n)} \in \mathbb {R}^D, y^{(n)} \in \{0,1\})_{n=1}^N\), where feature space \(D=64^2=4096\) and \(N=40,000\). Here, \(\textbf{x}^{(n)}\) and \(y^{(n)}\) are realizations of the random variables \(\textbf{X}\) and Y, with joint probability density function \(p_{\textbf{X},Y}(\textbf{x},y)\).

In each scenario, we generate a sample \(\textbf{x}^{(n)}\) as a combination of a signal pattern \(\varvec{a}^{(n)}\in \mathbb {R}^{D}\), carrying the set of truly important features used to form the ground truth for an ideal explanation, with some background noise \(\varvec{\eta }^{(n)} \in \mathbb {R}^{D}\). We follow two different generative models depending on whether the two components are combined additively or multiplicatively.

Additive generation process For additive scenarios, we define the data generation process

$$\begin{aligned} \textbf{x}^{(n)} = \alpha (R^{(n)} \circ (H \circ \varvec{a}^{(n)})) + (1-\alpha ) (G \circ \varvec{\eta }^{(n)}), \end{aligned}$$
(1)

for the n-th sample. Signal pattern \({\varvec{a}}^{(n)} ={\varvec{a}}(y^{n})\) carries differently shaped tetromino patterns depending on the binary class label \(y^{(n)} \sim \text {Bernoulli({1}/{2})}\). We apply a 2D Gaussian spatial smoothing filter \(H:\mathbb {R}^D \rightarrow \mathbb {R}^D\) to the signal component to smooth the integration of the pattern’s edges into the background, with smoothing parameter (spatial standard deviation of the Gaussian) \(\sigma _{\text {smooth}}=1.5\). The Gaussian filter H can technically provide infinite support to \(\varvec{a}^{(n)}\), so in practice we threshold the support at \(5\%\) of the maximum level. White Gaussian noise \(\varvec{\eta }^{(n)} \sim \mathcal {N}(\textbf{0}, \textbf{I}_D)\), representing a non-informative background, is sampled from a multivariate normal distribution with zero mean and identity covariance \(\textbf{I}_D\). For each classification problem, we define a second background scenario, denoted as CORR, in which we apply a separate 2D Gaussian spatial smoothing filter \(G:\mathbb {R}^D \rightarrow \mathbb {R}^D\) to the noise component \(\varvec{\eta }^{(n)}\). Here, we set the smoothing parameter to \(\sigma _{\text {smooth}}=10\). The third background type is that of samples from the ImageNet database (Deng et al., 2009), denoted IMAGENET. We scale and crop images to be \(64 \times 64\)-px in size, preserving the original aspect ratio. Each 3-channel RGB image is converted to a single-channel gray-scale image using the built-in Python Imaging Library (PIL) functions and is zero-centered by subtraction of the sample’s mean value.

As alluded to below, we also analyze a scenario where the signal pattern \(\varvec{a}^{(n)}\) underlies a random spatial rigid body (translation and rotation) transformation \(R^{(n)}: \mathbb {R}^D \rightarrow \mathbb {R}^D\). All other scenarios make use of the identity transformation \(R^{(n)} \circ (H \circ \varvec{a}^{(n)}) = H \circ \varvec{a}^{(n)}\). Transformed signal and noise components \((R^{(n)} \circ (H \circ \varvec{a}^{(n)}))\) and \((G \circ \varvec{\eta }^{(n)})\) are horizontally concatenated into matrices \(\textbf{A} = \left[ (R^{(1)} \circ (H \circ \varvec{a}^{(1)})), \hdots , (R^{(N)} \circ (H \circ \varvec{a}^{(N)})) \right]\) and \(\textbf{E} = \left[ (G \circ \varvec{\eta }^{(1)}), \hdots , (G \circ \varvec{\eta }^{(N)}) \right]\). Signal and background components are then normalized by the Frobenius norms of \(\textbf{A}\) and \(\textbf{E}\): \(R^{(n)} \circ (H \circ \varvec{a}^{(n)}) \leftarrow {(R^{(n)} \circ (H \circ \varvec{a}^{(n)}))}/{||\textbf{A}||_{\text {F}}}\) and \((G \circ \varvec{\eta }^{(n)}) \leftarrow {(G \circ \varvec{\eta }^{(n)})}/{||\textbf{E}||_{\text {F}}}\), where the Frobenius norm of a matrix \(\textbf{A}\) is defined as \(||\textbf{A}||_{\text {F}} {:}{=}(\sum ^N_{n=1}\sum ^D_{d=1}(\varvec{a}^{(n)}_d)^2)^{1/2}\). Finally, a weighted sum of the signal and background components is calculated, where the scalar parameter \(\alpha \in [0,1]\) determines the signal-to-noise ratio (SNR).

Multiplicative generation process For multiplicative scenarios, we define the generation process

$$\begin{aligned} \textbf{x}^{(n)} = \left( \varvec{1} - \alpha \left( R^{(n)} \circ (H^{(n)} \circ \varvec{a}^{(n)}) \right) \right) \left( G \circ \varvec{\eta }^{(n)}\right) \;, \end{aligned}$$
(2)

where \(\varvec{a}^{(n)}\), \(\varvec{\eta }^{(n)}\), \(R^{(n)}\), H and G are defined as above, \(\textbf{A}\) and \(\textbf{E}\) are Frobenius-normalized, and \(\varvec{1} \in \mathbb {R}^D\).

For data generated via either process, we scale each sample \(\textbf{x}^{(n)} \in \mathbb {R}^D\) to the range \([-1,1]^D\), such that \(\textbf{x}^{(n)} \leftarrow \textbf{x}^{(n)} / \max |\textbf{x}|\), where \(\max |\textbf{x}|\) is the maximum absolute value of any feature across the dataset.

Emergence of suppressors Note that the correlated background noise scenario induces the presence of suppressor variables, both in the additive and the multiplicative data generation processes. A suppressor here would be a pixel that is not part of the foreground \(R^{(n)} \circ (H \circ \varvec{a}^{(n)})\), but whose activity is correlated with a pixel of the foreground by virtue of the smoothing operator G. Based on previously reported characteristics of suppressor variables (Conger, 1974; Friedman & Wall, 2005; Haufe et al., 2014; Wilming et al., 2022), we expect that XAI methods may be prone to attributing importance to suppressor features in the considered linear and non-linear settings, leading to drops in explanation performance as compared to the white noise background setting.

Scenarios

We make use of tetrominoes (Golomb, 1996), geometric shapes consisting of four blocks (each block here being \(8 \times 8\)-pixels), to define each signal pattern \(\varvec{a}^{(n)} \in \mathbb {R}^{64 \times 64}\). We choose these as the basis for signal patterns as they allow a fixed and controllable amount of features (pixels) per sample, and specifically the ‘T’-shaped and ‘L’ shaped tetrominoes due to their four unique appearances under each 90-degree rotation. These induce statistical associations between features and target in four different binary classification problems:

Linear (LIN) and multiplicative (MULT)

For the linear case, we use the additive generation model Eq. (1), and for the multiplicative case, we instead use the multiplicative generation model. In both, signal patterns are defined as a ‘T’-shaped tetromino pattern \({\varvec{a}}^{\text {T}}\) near the top left corner if \(y=0\) and an ‘L’-shaped tetromino pattern \({\varvec{a}}^{\text {L}}\) near the bottom-right corner if \(y=1\), leading to the binary classification problem. Each pattern is encoded such that \(a^{\text {T/L}}_{i,j} = 1\) for each pixel in the tetromino pattern, positioned at the i-th row and j-th column of \(\varvec{a}^{\text {T/L}}\), and zero otherwise.

Translations and rotations (RIGID)

In this scenario, \(a^{\text {T/L}}\) defining each class are no longer in fixed positions but are randomly translated and rotated by multiples of 90 degrees according to a rigid body transform \(R^{(n)}\), constrained such that the entire tetromino is contained within the image. In contrast to the other scenarios, we use a 4-pixel thick tetromino here to enable a larger set of transformations, and thus increase the complexity of the problem. This is an additive manipulation in accordance with (1).

XOR The final scenario is that of an additive XOR problem, where we use both tetromino variants \(a^{\text {T/L}}\) in every sample. Transformation \(R^{(n)}\) is, once again, the identity transform here. Class membership is defined such that members of the first class, where \(y=0\), combine both tetrominoes with the background of the image either positively or negatively, such that \({\varvec{a}}^{\text {XOR++}} = {\varvec{a}}^{\text {T}} + {\varvec{a}}^{\text {L}}\) and \({\varvec{a}}^{\text {XOR-{}-}} = - {\varvec{a}}^{\text {T}} - {\varvec{a}}^{\text {L}}\). Members of the opposing class, where \(y=1\), imprint one shape positively, and the other negatively, such that \({\varvec{a}}^{\text {XOR+-}} = {\varvec{a}}^{\text {T}} - {\varvec{a}}^{\text {L}}\) and \({\varvec{a}}^{\text {XOR-+}} = - {\varvec{a}}^{\text {T}} + {\varvec{a}}^{\text {L}}\). Each of the four XOR cases are equally frequently represented across the dataset.

Figure 2 shows two examples from each class of each classification problem and for the three background types—Gaussian white noise (WHITE), smoothed Gaussian white noise (CORR), and ImageNet samples (IMAGENET). Figure 3 in the supplementary material shows examples of each of the 12 scenarios across four signal-to-noise ratios (SNRs).

Fig. 2
figure 2

Examples of data for each scenario, showing differences between samples of each class

Fig. 3
figure 3

Examples of generated data samples for each scenario, showing how a generated sample of Class #0 (where y=0) for each scenario varies across four signal-to-noise ratios (SNRs) \(\alpha\)

With each classification scenario defined, we can form the ground truth feature set of important pixels for a given input based on the positions of tetromino pixels as

$$\begin{aligned} \mathcal {F}^+(\textbf{x}^{(n)}) {:}{=}\left\{ d \mid \left( R^{(n)} \circ (H \circ \varvec{a}^{(n)}) \right) _d \ne 0, \, d \in \left\{ 1, \dots , 4096\right\} \right\} \;. \end{aligned}$$
(3)

For the LIN and MULT scenarios, each sample either contains a ‘T’ or an ‘L’ tetromino at a fixed position, corresponding to the fixed patterns \({\varvec{a}}^{\text {T}}\) and \({\varvec{a}}^{\text {L}}\). Since the absence of a tetromino at one location is just as informative as the presence of the other at another location, we augment the set of important pixels for these two settings as

$$\begin{aligned} \mathcal {F}^+(\textbf{x}^{(n)}) {:}{=}\left\{ d \mid H \circ \varvec{a}^{\text {T}}_d \ne 0 \vee H \circ \varvec{a}^{\text {L}}_d \ne 0, \, d \in \{1, \dots , 4096\} \right\} \;. \end{aligned}$$
(4)

Note that this definition is equivalent to Eq. (3) for the XOR scenario. Moreover, it is equivalent to an operationalization of feature importance put forward by Wilming et al. (2022) for the three static scenarios LIN, MULT, and XOR. Wilming et al. (2022) define any feature as important if it has a statistical dependency to the prediction target across the studied sample. In all cases, an ideal explanation method should attribute importance only to members of the set \(\mathcal {F}^+(\textbf{x}^{(n)})\).

For training each model and the subsequent analyses, we divide each dataset three-fold by a 90/5/5 split into a training set \(\mathcal {D}_{\text {train}}\), a validation set \(\mathcal {D}_{\text {val}}\), and a test set \(\mathcal {D}_{\text {test}}\).

2.2 Classifiers

We use three architectures to model each classification problem. Firstly, a Linear Logistic Regression (LLR) model, which is a single-layer neural network with two output neurons and a softmax activation function. Secondly, a Multi-Layer Perceptron (MLP) with four fully-connected layers, where each of the hidden layers uses Rectified Linear Unit (ReLU) activations. The two-neuron output layer is once again softmax-activated. Finally, we define a Convolutional Neural Network (CNN) with four blocks of ReLU-activated convolutional layers followed by a max-pooling operation, with a softmax-activated two-neuron output layer. The convolutional layers are specified with a progressively increasing amount of filters per layer [4, 8, 16, 32], a kernel size of four, a stride of one, and zero-padding. The max-pooling layers are defined with a kernel size of two and a stride of one.

We train a given classifier \(f^{\varvec{\theta }}:\mathbb {R}^D \rightarrow \mathcal {Y}\) over parameterization \(\varvec{\theta }\) and \(\mathcal {D}_{\text {train}}\). Each network is trained over 500 epochs using the Adam optimizer without regularization, with a learning rate of 0.0005. The validation dataset \(\mathcal {D}_{\text {val}}\) is used at each step to get a sense of how well the model is generalizing the data. Validation loss is calculated at each epoch and used to judge when the classifier has reached optimal performance, by storing the model state with minimum validation loss. This also prevents using an overfit model. Finally, the test dataset \(\mathcal {D}_{\text {test}}\) is used to calculate the resulting model performance, and is used in the evaluation of XAI methods. We consider a classifier to have generalized the given classification problem when the resulting test accuracy is at or above a threshold of \(80\%\).

Each network is implemented in PyTorch, and also in Keras with a TensorFlow backend, so to experiment over a wider variety of XAI methods implemented using either the Captum (Kokhlikyan et al., 2020) or iNNvestigate (Alber et al., 2018) frameworks. The main text focuses on the former.

2.3 XAI methods and performance baselines

Given a trained machine learning model, we now look to apply post-hoc XAI methods to test data to produce explanations, and we also define several performance baselines as a reference point for comparison during analyses. We compare sixteen popular XAI methods in our analysis. The main text focuses on the results of four: Local Interpretable Model Explanations (LIME) (Ribeiro et al., 2016), Layer-wise Relevance Propagation (LRP) (Bach et al., 2015), SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017) and Integrated Gradients (Sundararajan et al., 2017).

The full list is detailed in Table 1. This briefly summarizes each method, and provides the details of which library was used for implementation, Captum (Kokhlikyan et al., 2020) or iNNvestigate (Alber et al., 2018), as well as the specific parameterization for each method. Generally, we follow the default parameterization for each method. Where necessary, we specify the baseline \(\textbf{b}\) as the zero input \(\textbf{b} = \varvec{0}\), a common choice in the field (Mamalakis et al., 2022).

The input to an XAI method is a model \(f^{\varvec{\theta }}:\mathbb {R}^D \rightarrow \mathbb {R}\), trained according to parameterization \(\varvec{\theta }\) over \(\mathcal {D}_{\text {train}}\), the n-th test sample to be explained \(\textbf{x}_{\text {test}}^{(n)}\), as well as the baseline reference point \(\textbf{b} = \varvec{0}\) for relevant methods. The method produces an ‘explanation’ \(\textbf{s}(f^{\varvec{\theta }}, \textbf{x}_{\text {test}}^{(n)}, \textbf{b}) \in \mathbb {R}^D\).

Table 1 XAI Methods used with a brief description of each method and the implementation details, including the software framework used and any specific parameterization including the baseline input used, if applicable

We include four model-ignorant methods to generate ‘baseline’ importance maps for comparison with the aforementioned XAI methods. Firstly, we consider the Sobel filter, which uses both a horizontal and a vertical filter kernel to approximate first-order derivatives of data. Secondly, we use the Laplace filter, which uses a single symmetrical kernel to approximate second-order derivatives of data. Both are edge detection operators, and are given for each test sample as an input. Thirdly, we use a sample from a random uniform distribution \(U((-1,1)^D)\). Finally, we use the rectified test data sample \(\textbf{x}_{\text {test}}^{(n)}\) itself as an importance map.

2.4 Explanation performance metrics

Based on the well-defined ground truth set of class-dependent features for a given sample \(\mathcal {F}^+(\textbf{x}^{(n)})\), we can readily form quantitative metrics to evaluate the quality of an explanation.

2.4.1 Precision

Omitting the sample-dependence in the notation, we define precision as the fraction of the \(k=|\mathcal {F}^+|\) features of \(\textbf{s}\) with the highest absolute-valued importance scores contained within the set \(\mathcal {F}^+\) itself, over the total number of important features \(|\mathcal {F}^+|\) in the sample. We constrain these results to the submitted appendices, and focus on the results and analyses for the next two defined metrics.

2.4.2 Earth mover’s distance (EMD)

The Earth mover’s distance (EMD), also known as the Wasserstein metric, measures the optimal cost required to transform one distribution to another. We can apply this to the cost required to transform a continuous-valued importance map \(\textbf{s}\) into \(\mathcal {F}^+\), where both are normalized to have the same mass. The Euclidean distance between pixels is used as the ground metric for calculating the EMD, with \(\textrm{OT}(\textbf{s}, \mathcal {F}^+)\) denoting the cost of the optimal transport from explanation \(\textbf{s}\) to ground truth \(\mathcal {F}^+\). This follows the algorithm proposed by Bonneel et al. (2011) and the implementation of the Python Optimal Transport library (Flamary et al., 2021). We define a normalized EMD performance score as

$$\begin{aligned} \textrm{EMD} = 1 - \frac{\textrm{OT}(\textbf{s}, \mathcal {F}^+)}{\delta _{max}}, \end{aligned}$$
(5)

where \(\delta _{max}\) is the maximum Euclidean distance between any two pixels.

Remark

Note that the ground truth \(\mathcal {F}^+(\textbf{x})\) defines the set of important pixels based on the data generation process. It is conceivable, though, that a model uses only a subset of these for its prediction, which must be considered equally correct. The above explanation performance metrics do not fully achieve invariance in that respect. However, both are designed to de-emphasize the impact of false-negative omissions of features in the ground truth on performance, while emphasizing the impact of false-positive attributions of importance to pixels not contained in the ground truth.

2.4.3 Importance mass accuracy (IMA)

Because of this, we consider a third metric, Importance Mass Accuracy (IMA). Calculated as the sum of importance attributed to the ground truth features over the total attribution in the image, this metric is akin to ‘Relevance mass accuracy’ as defined by Arras et al. (2022). We calculate

$$\begin{aligned} \textrm{IMA} = \underset{s_i \in \mathcal {F}^+}{\sum _{i=1}^{|\mathcal {F}^+|}} s_i / \sum _{i=1}^{|\textbf{s}|} s_i. \end{aligned}$$
(6)

This metric achieves invariance for not penalizing false negative attribution to a subset of pixels in \(\mathcal {F}^+(\textbf{x})\), whilst also utilizing the whole attribution instead of a ‘top-k’ metric such as Precision. Not only this, but it is a direct measure of false-positive attribution, where a score of 1 signals a perfect explanation highlighting only ground truth features as important. We use this metric to complement the strengths of \(\textrm{EMD}\) whilst also presenting an alternative perspective to quantifying explanation performance.

3 Experiments

Our experiments aim to answer four main questions:

  1. 1.

    Which XAI methods are best at identifying truly important features as defined by the sets \(\mathcal {F}^+(\textbf{x})\)? We do not expect that any method would achieve perfect performance for our metrics, as this may be unrealistic due to the aforementioned ways that our metrics interact with \(\mathcal {F}^+(\textbf{x})\). We hypothesize, however, that performance trends are consistent between scenarios. With this in mind, we do not aim to explicitly rank methods for the purpose of handing out recommendations for XAI methods to use in practice. Our focus is more toward comparing XAI method performance to baseline methods to identify performance weaknesses to guide future development of improved methods. With that in mind, the past study of Wilming et al. (2022) showed that the PatternNet and PatternAttribution (Kindermans et al., 2018) methods perform best in a linear problem setting, so we would expect to see the same here for LIN. How this performance transitions to non-linear methods is yet to be seen, motivating the following experiments.

  2. 2.

    Does explanation performance for each method remain consistent when moving from explaining a linear classification problem to problems with different degrees of non-linearity? No prior studies exist on this comparison between linear and non-linear problem settings, however we anticipate that it is difficult to directly compare between different scenarios. One difficulty is that each method requires a different trained model, and while our implementations are aimed to be as equivalent as possible, it has been shown that explanation performance is affected by classification performance (Arras et al., 2022; Oliveira et al., 2024). Another aspect complicating comparisons across scenarios is due to properties of the scenarios themselves. Some XAI methods may perform better in the scenarios with a fixed ground truth position over the RIGID scenario.

  3. 3.

    Does adding correlations to the background noise, through smoothing with the Gaussian convolution filter, negatively impact explanation performance? Suppressor variables have been shown to negatively impact explanation performance (Haufe et al., 2014; Wilming et al., 2022; Oliveira et al., 2024). Here, the correlation between background pixels overlapping with tetromino features and background pixels near the tetromino invokes the presence of suppressor variables (those neighboring pixels). Knowledge of these background pixels may be useful to the machine learning models, for instance for denoising the correlated background to make the underlying classification easier. We therefore expect that performance for CORR scenarios will be worse than WHITE equivalents, although this performance difference also will likely depend on the strength of the correlation of the smoothing operation.

  4. 4.

    How does the choice of model architecture impact explanation performance? XAI methods may perform differently for different architectures. For example, GradCAM (Selvaraju et al., 2017) is only applicable to Convolutional Neural Network (CNN) architectures, but others applicable to some or all models studied here may prefer properties of one model architecture over another. CNNs may perform better than MLPs for the RIGID scenario, as the invariance of translation and rotation operations is one of the main desirable properties of CNNs. We expect that such differences between model architecture will also be seen when considering the downstream task of model explanation. We generate a dataset for each scenario across a range of 20 choices of \(\alpha\), finding the ‘sweet spot’ where average test accuracy over 10 trained models is at or above 80%. Table 2 shows the resulting \(\alpha\) values as well as the average test accuracy for each scenario, over five model trainings for datasets of size \(N=40,000\) of each scenario. What can be seen is that a wide range of SNRs are required to model each problem, and it is difficult to exactly model each scenario and background type to the \(80\%\) performance threshold. CORR scenarios, perhaps aided by the suppressing correlated background pixels, achieve the best performance on average while requiring the lowest SNRs when compared to WHITE and IMAGENET variants. The MULT WHITE scenario is particularly difficult to model, requiring a much higher SNR to model than the CORR and IMAGENET variants. For training each model and the subsequent analyses, we divide each dataset three-fold by an 90/5/5 split into a training set \(\mathcal {D}_{\text {train}}\), a validation set \(\mathcal {D}_{\text {val}}\), and a test set \(\mathcal {D}_{\text {test}}\). From this, we compute absolute-valued importance maps \(|\textbf{s}|\) for the intersection of test data \(\mathcal {D}^{\text {test}}\) correctly predicted by every appropriate classifier. The full table of training results for finding appropriate SNRs can be seen in Fig. 4 in Appendix B.5. Experiments are run on an internal CPU and GPU cluster, with total runtime in the order of a matter of hours.

Table 2 Results of the model training process for each classification setting, model architecture, and background type
Fig. 4
figure 4

Average test accuracy over 10 model trainings for each problem scenario and model architecture, for a fixed range of signal-to-noise ratios (SNRs). As expected, the Linear Logistic Regression (LLR) model cannot perform above chance level for non-linear scenarios. The Convolutional Neural Network (CNN) outperforms the Multi-Layer Perceptron (MLP) for the RIGID (translations and rotations of tetrominoes) scenarios as expected, perhaps due to the invariance under these properties for this architecture

4 Results

With data generated, models trained, and experiments defined, we move to analyzing the explanations produced by the given set of post-hoc XAI methods. We first start with qualitative analysis, looking at example explanations produced for given samples in Figs. 5, B.7.1, and 8. Such analysis is commonly used in XAI methods papers (for example, Bach et al. (2015); Ribeiro et al. (2016); Lundberg and Lee (2017)), with authors assessing the visual quality of explanations for a chosen example, and little to no quantitative analysis being done to verify explanation performance empirically. As such, we focus on quantitative analysis afterwards, showing boxplots of explanation performance for the EMD and IMA metrics in Fig. 6 and in Appendix B.7.2.

Fig. 5
figure 5

Absolute-valued importance maps obtained for a random correctly-predicted data sample, for selected XAI methods and baselines. Recovery of the ground truth pattern across all scenarios is best shown by XAI methods applied to a Linear Logistic Regression (LLR) model. The Multi-Layer Perceptron (MLP) tends to focus on noise in the case of ImageNet backgrounds, and LIME often fails to produce sensical explanations across all model architectures (Color figure online)

Fig. 6
figure 6

Quantitative explanation performance of individual sample-based feature importance maps produced by various XAI approaches and baseline methods on correctly-predicted test samples, as per the \(\textrm{EMD}\) (top) and \(\textrm{IMA}\) (bottom) metrics. Depicted are boxplots of median explanation performance, with upper and lower quartiles as well as outliers shown. The white areas (left) show results for white background noise (WHITE), whereas the light gray shaded areas (middle) shows results for the correlated background noise (CORR) scenarios and the darker gray areas (right) for ImageNet (IMAGENET) backgrounds (Color figure online)

4.1 Qualitative analysis

Figure 5 depicts examples of absolute-valued importance maps produced for a random correctly-predicted sample for each scenario and model. Shown are results for four XAI methods (Gradient SHAP, LIME, LRP, and PatternNet respectively) for each of the three models (LLR, MLP, CNN respectively) followed by the model-ignorant Laplace filter. Qualitative recovery of the signal tetromino patterns is mixed across all scenarios, models, and XAI methods, with no single method looking to perform the best. LIME, however, fails to produce sensical explanations in all cases. While for no method the importance is predominantly contained within the ground truth pattern, the tetromino patterns can be recognized in many cases, even at low signal-to-noise ratios. The MLP tends to produce noisier explanations than the CNN, especially for the complex structures in the background of the ImageNet examples. We can often see noisy false-positive attribution to statistically irrelevant features related to the image background. In many of the explanations for the scenarios with the CORR background using the LLR and MLP, we can see ‘halos’ of importance attributed to features outside, but nearby, the ground truth. This points to the potential use of suppressor variables by the models, in this case pixels outside the ground truth that are correlated with pixels inside the ground truth due to overlapping structures in the image background. Appendix B.7.1 expands on the qualitative results of the main text, and Fig. 8 shows the absolute-valued global importance heatmaps for the LIN, MULT, and XOR scenarios, given as the mean of all explanations for every correctly-predicted sample of the given scenario and XAI method. As the RIGID scenario has no static ground truth pattern, calculating a global importance map is not possible.

4.2 Quantitative analysis

Figure 6 shows explanation performance of individual sample-based importance maps produced by the selected XAI and baseline methods, across five models trained for each scenario-architecture parameterization, in terms of the \(\textrm{EMD}\) and \(\textrm{IMA}\) metrics. Appendix B.7.2 expands on the quantitative results of the main text, detailing results for all 16 methods studied and for our Precision metric.

We can now answer the experimental questions laid out in Sect. 3:

1. Which XAI methods are best at identifying truly important features as defined by the sets \(\mathcal {F}^+(\textbf{x})\)?

Within most scenario-architecture parameterizations, the performances of the studied XAI methods are relatively homogeneous. Baseline methods also tend to perform similarly to one another. Interestingly, their performance is on par or even superior to various XAI methods in certain scenarios. Most notably, a simple Laplace edge detection filter outperforms nearly all other methods in the RIGID as well as the XOR scenarios, when used in combination with correlated backgrounds (CORR). \(\textrm{IMA}\) results for baseline methods in the RIGID scenario show a lot less variance in the boxplots of Fig. 6b than for the \(\textrm{EMD}\) equivalents in Fig. 6a.

The results show massive variability in performance for all methods across different problems and model architectures, so we cannot necessarily declare one specific ‘best’ method. For the linear case, we can recommend PatternNet and PatternAttribution as being able to recover signal optimally in the presence of suppressor variables, however this property does not translate well to non-linear cases. Looking at Fig. 6, no method performs consistently near the, perhaps impossible, perfect EMD or IMA score. In many cases, particularly for IMA, the scores are low (\(\le 0.5\)) across the board, signaling around or less than half of the total image attribution going to truly important features. Four exceptions are for CNN results in the RIGID case, where LRP for WHITE and IMAGENET, GradSHAP for WHITE, and Integrated Gradients for WHITE achieved an average above 0.8.

LIME fails in all cases at this dimensionality, so may be unsuitable for a user looking to implement explainability on higher dimensional image data. Plotting results without LIME may help to improve visual spread of results, however as such a popular method of the field, it is important to study it. From Figs. 9, 10, and 11, we can see that gradient-based methods tend to perform similarly to one another, as well as SHAP variants to one another. While this may not be a surprise, when each of these variants and formulations are supposed by their creators to possess benefits over other XAI methods, the reality shows a different story. For example, one might hope that DeepSHAP shows stronger performance for deeper architectures such as the CNN than GradSHAP, however this is not necessarily the case. It is also possible that the CNN architecture studied here is not deep enough to provide benefit to such a method like DeepSHAP.

We also observe that comparison in terms of the EMD metric are made difficult by the metrics comparably narrow range of values, with the rand method (sampling from a uniform distribution) averaging between 0.7 for RIGID and 0.875 otherwise being evidence for this.

In answer to experimental question 1, for the purposes of this study - assessing the false-positive attribution of feature importance to non-important variables such as suppressors - it is not clear which individual XAI method can be considered the ‘best’, however we have shown that the random performance baselines can achieve competitive or even improved explanation performance scores over many XAI methods. It is important for strong XAI methods to outperform such baselines to justify their use in practice, and future methods should be developed with such baselines in mind.

2. Does explanation performance for each method remain consistent when moving from explaining a linear classification problem to problems with different degrees of non-linearity?

Here we can see again that some methods vary in performance depending on the type of non-linearity (most perform better for MULT with the fixed position non-linearity than for RIGID), with a larger spread of \(\textrm{EMD}\) and \(\textrm{IMA}\) scores (seen in the size of boxes and whiskers of Fig. 6) for non-linear scenarios than for LIN.

The results for PatternNet and PatternAttribution (Kindermans et al., 2018) shown in the appendix (Figs. 9, 1011, 17, and 18) were proposed in part for solving the suppressor problem, and we can see how this is not necessarily always the case. These methods show strong performance for LIN as proposed, and as was seen in Wilming et al. (2022), but do not look to generalize as well in most non-linear scenarios. Notably when the pattern signal is not in a fixed position (i.e., RIGID), these methods perform worse than when the signal is in a fixed position (i.e., MULT and XOR). More specifically, they also look to learn the complete pattern signal (i.e., the tetromino shapes for both classes), so in the XOR case where both shapes are present and fixed in each sample, they do outright perform the best as one might expect.

The results for the RIGID scenarios may be taken with a pinch of salt, as the high signal-to-noise ratios (SNRs) lead to highly salient tetrominoes in sample images. Notably, explanations produced for CNNs in this case tend to perform very well for both the \(\textrm{EMD}\) and \(\textrm{IMA}\) metrics compared to most results for any other model architecture and problem scenario. While this problem itself (identifying a pattern with rotation and scaling invariance) is the most realistic of the four presented here, particularly when applied to CNNs, the high saliency of tetrominoes is perhaps not wholly akin to realistic problem settings, where the relative saliency of individual objects of interest is usually far lower. The high saliency of the tetrominoes derives from our experimental choice to adjust SNRs to achieve a predefined minimal classification performance threshold, which required high SNR in this setting. An alternative approach could be to reverse this and fix the SNR for all scenarios and background types.

3. Does adding correlations to the background noise, through smoothing with the Gaussian convolution filter, negatively impact explanation performance?

When looking at results from WHITE to CORR, we can spot a decrease in performance and increase in spread in most cases. This can be attributed to the fact that the imposed correlations (induced through Gaussian smoothing) between background pixels correlated with those overlapping with \(\mathcal {F}^+\) cause background pixels to act as suppressor variables. One can control the strength of this effect by increasing/decreasing the strength of the Gaussian smoothing’s sigma parameter. This effect can be most strongly observed when comparing RIGID WHITE to RIGID CORR for the \(\textrm{IMA}\) metric, suggesting that correlations in the background do indeed increase false-positive attribution in model explanations.

4. How does the choice of model architecture impact explanation performance?

For LIN, explanation performance of all methods for all architectures is similar in most cases. When moving to non-linear scenarios, we can see little consistency in how architectures perform - the CNN can be seen to perform best in the RIGID case, but the MLP performs relatively better for the fixed tetromino position cases of MULT and XOR.

In a few cases, performance tends to decrease as model complexity increases (from the simple LLR to the complex CNN architecture). One notable exception is for the RIGID scenario, where the CNN outperforms other models. This can perhaps be explained by the CNN architecture tending itself well to rotation/translation invariance, whereas the properties of the MLP work better for a fixed-position ground-truth class-conditional distribution. However, in the RIGID setting nearly all XAI methods are outperformed by a simple Laplace edge detection filter for correlated backgrounds results. In this case, the discrepancy between the MLP and CNN performance is amplified for the \(\textrm{IMA}\) metric, with the CNN performing relatively better for a few XAI methods. The CNN also performs well in the case of the more-complicated IMAGENET backgrounds.

We can also note that when multiple models present similar classification performance for a task, a user may assume or just not realize that explanation performance could be vastly different, as seen in the MLP vs CNN results of RIGID in Fig. 6, and qualitatively in Fig. 5 across all architectures.

5 Discussion

Experimental results confirm our main hypothesis that explanation performance is lower in cases where the class-specific signal is combined with a highly auto-correlated class-agnostic background (CORR) compared to a white noise background (WHITE). The difficulty of XAI methods to correctly highlight the truly important features in this setting can be attributed to the emergence of suppressor variables. Importantly, the misleading attribution of importance by an XAI method to suppressors can lead to misinterpretations regarding the functioning of the predictive model, which could have severe consequences in practice. Such consequences could be unjustified mistrust in the model’s decisions, unjustified conclusions regarding the features related to a certain outcome (e.g., in the context of medical diagnosis), and a reinforcement of such false beliefs in human-computer interaction loops. It is therefore important that future XAI methods be developed to either highlight only truly important features, or to inform the user of whether an importantly-attributed variable is truly important, a suppressor, or otherwise.

We have also seen that when multiple ML architectures can be used interchangeably to appropriately solve a classification problem – here with classification accuracy required to be above 80% – they may still produce disparate explanations. Architectures not only differed with respect to the selection of pixels within the correct set of important features, but also showed different patterns of false-positive attributions of importance to unimportant background features. If one cannot produce consistent and sensical results for multiple seemingly appropriate ML architectures, the risk of model mistrust may be especially pronounced.

A recent survey showed that one in three XAI papers evaluate methods exclusively with anecdotal evidence, and one in five with user studies (Nauta et al., 2023). Other work in the field tends to focus on secondary criteria (such as stability and robustness (Rosenfeld et al., 2021; Hedström et al., 2022)) or subjective or potentially circular criteria (such as fidelity and faithfulness (Gevaert et al., 2022; Nauta et al., 2023)). It was shown in Wilming et al. (2023) that faithfulness as a concept, when treated as an XAI method in itself, promotes the attribution of importance to suppressor variables. We therefore doubt that such secondary validation approaches can fully replace metrics assessing objective notions of ‘correctness’ of explanations, considering that XAI methods are widely intended to be used as means of quality assurance for machine learning systems in critical applications. Thus, the development of specific formal problems to be addressed by XAI methods, and the theoretical and empirical validation of respective methods to address specific problems, is necessary. In practice, a stakeholder may often (explicitly or implicitly) expect that a given XAI method identifies features that are truly related to the prediction target. If suppressors are present in the data and are highlighted as important by an XAI method, the user may seek to use these variables as a target for intervention (e.g. as a genetic manipulation or drug target in the context of a genome wide association experiment). However, any attempt to manipulate suppressor features to influence the prediction target would be futile. In the worst case, time and money would be wasted, and in any case, the false-positive attribution of importance to suppressor features has provided no value to the user. In contrast to other notions of faithfulness, the expectation that an XAI method identifies features truly related to the target is an objectively quantifiable property of an XAI method, and we here propose various linear and non-linear types of ground-truth data along with appropriate metrics to directly measure explanation performance according to this definition. While our work is not the first to provide quantitative XAI benchmarks (see, Tjoa and Guan, 2020; Li et al, 2021; Zhou et al, 2022; Arras et al, 2022; Gevaert et al, 2022; Agarwal et al, 2022), our work differs from most published papers in that it allows users to quantitatively assess potential misinterpretations caused by the presence of suppressor variables in data.

One potential limitation of the \(\textrm{EMD}\) metric is the strictness of limiting the ground truth feature set \(\mathcal {F}^+\) to the specific pixels of tetrominoes \({\varvec{a}}^{\text {T/L}}\) compared to, say, the set of features outlining \({\varvec{a}}^{\text {T/L}}\). Alternative definitions of \(\mathcal {F}^+\) could be conceived to more flexibly adapt to different potential ‘explanation strategies’. Figure 7 in the appendices outlines four ‘explanation strategies’ and how the \(\textrm{EMD}\) metric varies with each. Notably, an ‘outline’ explanation performs worse than an explanation highlighting a subset of \(\mathcal {F}^+\). This highlights two interesting features of our novel metric. Firstly, a strongly performing ‘subset’ explanation shows that \(\textrm{EMD}\) does not penalize false negatives (not attributing high importance to some truly important features) as harshly as Precision and other ‘top-k’ metrics do. Secondly, the ‘outline’ explanation functions in a presumably similar way to some model-ignorant edge detection methods, and performs the worst of any explanation strategy shown in Fig. 7. Yet, we have shown such edge detection methods to be capable of outperforming many XAI methods in some problem scenarios. Our \(\textrm{IMA}\) metric also complements this potential limitation of \(\textrm{EMD}\), where it does not matter if the attribution of importance to features of \(\mathcal {F}^+\) is spread across all features, or just more intensely attributed to a subset. This metric directly measures false-positive attribution of importance to features outside of \(\mathcal {F}^+\), and assists the user in understanding the role that suppressors play in model explanations. We have also seen that the EMD metric produces scores over quite a small range, where a ‘low’ EMD score is hard to achieve, even for a truly random explanation (rand). This contributes to the close distribution of results of Figs. 6a and 9, making experimental question 1 tougher to answer. Future work will look into the development of improved metrics to quantitatively evaluate XAI methods more robustly. For example, we can use the null distribution to normalize/standardize the EMD metric, either by subtracting the null mean or by subtracting the mean and dividing by the null standard deviation, thus widening the range of realistically attainable scores. This benchmark focuses on the issue of ‘correctness’ of explanations, so we will also unify the XAI-TRIS benchmarks and (improved) metrics with ‘secondary’ quality metrics such as robustness and, potentially, faithfulness/fidelity. Doing so will widen the characteristics studied by each given metric, and will provide a more comprehensive overview of the performance of XAI methods beyond ‘correctness’.

While we compare a total of 16 XAI methods, the space of possible neural network architectures is too vast to be represented; therefore we only compared one MLP and one CNN architecture here. However, our experiments hopefully serve as a showcase for our benchmarking framework, which can be easily extended to other architectures. Finally, our framework serves much needed validation purposes for methods that are conceived to themselves play a role in the quality assurance of AI. As such, we expect that the benefits of our work far outweigh potential negative implications on society, if any. A possible risk, even if far-fetched, would be that one may reject a fit-for-purpose XAI method based on empirical benchmarks such as ours, which do not necessarily reflect the real-world setting and may hence be too strict.

Future work will also focus on integration of the XAI-TRIS benchmarks with other related benchmarks (Wilming et al., 2022; Oliveira et al., 2024) into one platform, aiming to test the performance of XAI methods across a suite of domains and problems. We also plan to extend this with the creation of more realistic benchmarks in the domains of medical imaging and natural language processing. With the availability of such a unified benchmark suite, the possibility of developing fit-for-purpose and goal-driven XAI methods is open to researchers.

6 Conclusion

We have used a data-driven generative definition of feature importance to create XAI-TRIS, synthetic datasets with well-defined ground truth explanations, and have used these to provide an objective assessment of XAI methods when applied to various classification problems. Furthermore, we have defined new quantitative metrics of explanation performance and demonstrated that many popular XAI methods do not behave in an ideal way when moving from linear to non-linear scenarios. Our results have shown that XAI methods can even be outperformed by simple model-ignorant edge detection filters in the RIGID use case, in which the object of interest is not located in a static position. Finally, we have shown that XAI methods may provide inconsistent explanations when using different model architectures under equivalent conditions. Future work will be to develop dedicated performance benchmarks in more complex and application-specific problem settings such as medical imaging.