1 Introduction

Generative AI (GenAI) has shown remarkable capabilities that have shaken up the world on a broad basis—ranging from regulators (European Union 2023), educators (Baidoo-Anu and Ansah 2023), programmers (Sobania et al. 2023) onto medical staff (Thirunavukarasu et al. 2023). For businesses (Porter 2023), GenAI has the potential to unlock trillions of dollars annually (McKinsey & Company 2023). At the same time, it is said to threaten mankind (The Guardian 2023). These opposing views are a key drive for understanding and explaining GenAI. Generative AI, driven by foundation models, represents the next level of AI, capable of creating text, images, audio, 3D solutions, and videos (Schneider et al. 2024c; Gozalo-Brizuela and Garrido-Merchan 2023; Cao et al. 2023), all controllable by humans through textual prompts (White et al. 2023) (see Table 2 for examples of public GenAI systems). This advancement marks a significant shift from AI that primarily "recognizes" to AI that "generates." GenAI has shown unprecedented capabilities like passing university-level exams (Choi et al. 2021; Katz et al. 2024). It also achieves remarkable results in areas once considered unsuitable for machines, such as creativity (Chen et al. 2023a). It is accessible to everyone, as witnessed by commercial systems like ChatGPT (Achiam et al. 2023) and Dall-E (Betker et al. 2023; Ramesh et al. 2022). Early generative AI methods, like Generative Adversarial Networks (GANs), could also generate artifacts but were typically more difficult to control compared to modern models such as transformers and diffusion architectures.

Explainable AI for GenAI (GenXAI) techniques provide explanations that help understand AI outputs for individual inputs or the model as a whole. Traditionally, explanations have served various purposes, such as increasing trust and aiding in model debugging (Meske et al. 2022). The need for understanding AI is greater now than in pre-GenAI eras. For example, explanations can support the verifiability of generated content, helping combat one of GenAI’s major problems: hallucinations (as discussed in Sect. 3.1). Unfortunately, Explainable AI (even for pre-GenAI models) still faces several open problems despite numerous attempts to address them over the past few years (Longo et al. 2024; Meske et al. 2022). For example, a recent comparison (Silva et al. 2023) of methods on the impact of XAI on human-agent interaction found only a 20% difference in scores between the best (counterfactuals) and worst method (using probability scores), suggesting that complex methods offer limited benefits over simpler ones. Therefore, XAI techniques are still far from optimal. Other works have even described the “status quo in interpretability research as largely unproductive” (Räuker et al. 2023). Therefore, much work remains, and it is essential to understand current efforts to learn from and improve upon them, especially to mitigate high risks (The Guardian 2023) while leveraging opportunities (Schneider et al. 2024c).

Table 1 Nomenclature

This research manuscript aims to make genuine progress in this direction. Our goal is not merely to list and structure existing XAI techniques; at this stage, more fundamental questions need addressing, such as identifying key challenges and desiderata for GenXAI. To this end, we opted for a narrative review methodology (King and He 2005) combined with a taxonomy development approach from the field of information systems (Nickerson et al. 2013).Footnote 1 Several surveys on XAI focus on the pre-GenAI era with a primary technical focus (Adadi and Berrada 2018; Zini and Awad 2022; Dwivedi et al. 2023; Schwalbe and Finzel 2023; Räuker et al. 2023; Saeed and Omlin 2023; Speith 2022; Minh et al. 2022; Bodria et al. 2023; Theissler et al. 2022; Guidotti et al. 2019; Guidotti 2022) and an interdisciplinary or social science focus (Miller 2019; Meske et al. 2022; Longo et al. 2024). Building upon these, we conduct a meta-survey to structure our methods, leveraging knowledge from the pre-GenAI era. Additionally, we uncover novel aspects related to GenAI that have not yet been covered. Many works have surveyed various aspects of GenAI (excluding XAI)  (Xu et al. 2023; Lin et al. 2022; Xing et al. 2023; Yang et al. 2023b; Zhang et al. 2023a, c; Pan et al. 2023). We use such surveys for our technical background. Some sub-areas of GenAI, such as knowledge identification and editing (Zhang et al. 2024), use isolated XAI techniques as tools but do not aim to elaborate on them generally. While we could not find any review discussing XAI for GenAI, some research manuscripts take a holistic, partially opinionated view on XAI for large language models (LLMs) (Singh et al. 2024; Liao and Vaughan 2023) or explicitly survey XAI for LLMs (Zhao et al. 2023a; Luo and Specia 2024). None of the prior works provide a comprehensive list of desiderata, motivations, challenges for XAI for GenAI, and a taxonomy. Many of our novel aspects, in particular, cannot be found in prior works. Furthermore, even when focusing solely on LLMs, we differ considerably from prior works.

Fig. 1
figure 1

Article outline. Following important factors and challenges of XAI for GenAI, desiderata and, in turn, a taxonomy is derived. All three together inform our research agenda

We start by presenting a technical background. To derive our contributions, we follow the outline in Fig. 1. We then provide motivation and challenges for XAI for GenAI, highlighting novel aspects that emerge with GenAI, such as its increased societal reach and the need for users to interactively adjust complex, difficult-to-evaluate outputs. From this, we derive desiderata, i.e., requirements that explanations should ideally meet, such as supporting interactivity and output verification. Next, we develop a taxonomy for existing and future XAI techniques for GenAI. To categorize XAI, we use dimensions related to the inputs, outputs, and internal properties of GenXAI techniques that distinguish them from pre-GenAI, such as self-explanation and different sources and drivers for XAI, like prompts and training data. Using the identified challenges and desiderata, the remainder of this manuscript focuses on discussing novel dimensions for GenXAI and the resulting taxonomy, as well as XAI methods in conjunction with GenAI. Finally, we provide future directions. Our key contributions include describing the need for XAI for GenAI, outlining desiderata for explanations, and developing a taxonomy for mechanisms and algorithms that includes novel dimensions for categorization (Table 2).

Table 2 Examples of in-/outputs for GenAI. For more examples, see (Gozalo-Brizuela and Garrido-Merchan 2023)

2 Technical background

Here, we provide a short technical introduction to generative AI, covering key ideas on system and model architectures and training procedures. We restrict ourselves to text and image data to illustrate multi-modality. For video and audio, please refer to other surveys (for example (Selva et al. 2023; Zhang et al. 2023b)). Nomenclature is provided in Table 1.

2.1 System architectures

GenAI models can function as stand-alone applications with a simple user interface, allowing textual inputs or uploads, as seen with OpenAI’s ChatGPT(Achiam et al. 2023), and displaying responses. Thus, a system might be essentially one large model, where a model is almost exclusively based on deep learning taking an input processed by a neural network yielding an output. For multi-modal applications, systems that consist of an LLM and other generative models, such as diffusion models, are typically employed. However, GenAI-powered systems may involve external data sources and applications interacting in complex patterns, as illustrated in Fig. 2. An orchestration application may determine actions based on GenAI outputs or user inputs. For example, in ChatGPT-4, a user can include a term like “search the internet” in the prompt, implying that an Internet search is conducted first, and the retrieved web content is then fed into the GenAI model. The orchestration application is responsible for performing the web search and modifying the prompt to the GenAI model, e.g., enhancing it with an instruction like “Answer based on the following content:” followed by the retrieved web information.

Fig. 2
figure 2

Overview of GenAI system architectures comprising a single or combined model (with a GUI) as well as GenAI systems interacting with other applications

2.2 GenAI model architectures

We discuss key aspects of the transformer architecture and diffusion models and briefly elaborate on other generative models.

2.2.1 Transformers

Transformers are the de-facto standard for LLMs, while GenAI models involving images may also use models like diffusion models, variational autoencoders (VAEs), and generative adversarial networks (GANs). The transformer model makes few assumptions about the input data, making it highly flexible. Data assumptions (priors) also help reduce the amount of data needed to train a model. Therefore, transformers often require more data than other models to achieve the same performance, although simpler models might never reach the same top-level performance. The transformer architecture (Vaswani et al. 2017) (Fig. 3) has many variations (Lin et al. 2022), mostly involving different implementations of individual elements, such as different types of positional embeddings (Dufter et al. 2022), different types of attention (de Santana Correia and Colombini 2022), or even replacing some components. For instance, Hyena (Poli et al. 2023) provides a drop-in replacement for attention based on convolutions. The goals of these adjustments are typically better performance and faster computation. For example, the original transformer requires quadratic run-time in the number of inputs, making it prohibitive for very long inputs. The vanilla transformer architecture (Fig. 3) consists of an encoder and a decoder, where the decoder processes the outputs of the encoder and the (shifted) targets. Consider a translation scenario, where the encoder takes a sentence in the source language, and the decoder generates one output word at a time in the desired language. Each generated word also becomes an input to the decoder for generating the next word. Decoder-only architectures, such as the GPT series (Radford et al. 2019; Achiam et al. 2023), lack an encoder. In contrast, encoder-only architectures lack a decoder and typically produce contextualized embeddings of single words or text fragments (e.g., BERT (Devlin et al. 2019)).

Fig. 3
figure 3

Transformer architecture (Vaswani et al. 2017)

Both the encoder and the decoder take tokens as inputs. The embeddings are vectors in a latent space. Raw tokens can only be compared for equality, vectors in a latent space allow a more nuanced similarity computation and, potentially, the extraction of specific token attributes such as sentiment. Thus, many current GenAI systems use encoders to obtain vector embeddings and retrieve relevant information to enhance prompts by searching in a vector database (Lewis et al. 2020). A positional encoding is added to the text embedding, so the network knows the position of each word. After that, inputs are processed through multi-head attention. A single-head attention mechanism focuses on a specific aspect of the input (e.g., syntax or sentiment), which is then further processed through a Feed Forward Network (FFN). Attention can be masked for the decoder so it cannot access the actual or future targets it wants to predict. For example, in the common task of next-word prediction, the triangular matrix mask prevents the network from accessing the next word and the words following it.

2.2.2 Diffusion models

Diffusion models learn to reconstruct noisy data. They first distort inputs by repeatedly adding small amounts of noise until the image appears to be noise following, e.g., a Gaussian distribution one can sample from to generate images. They reverse the process for sample generation by taking “noise” as input and reconstructing samples. There are several mathematically intricate methods for diffusion models (Yang et al. 2023b). Here, we discuss the key steps of a prominent technique highly relevant for text-to-image generation: the Denoising Diffusion Probabilistic Model (DDPM) (Ho et al. 2020). In the forward pass (input distortion), DDPM acts as a Markov chain, meaning only the current state (or input) is relevant for the next output. For a given data distribution \(x_0 \sim q(x_0)\), DDPM produces output \(x_T\) in a sequence of T sequential steps by computing at step t: \(q(x_t|x_{t-1})=N(x_t;\sqrt{1-\beta _t}x_{t-1},\beta _t I)\). Thus, the overall output is \(q(x_T|x_0)=\prod _{t=1}^T q(x_t|x_{t-1})\). Here, N represents the Gaussian distribution, and \(\beta _t\) is a hyperparameter. The reverse pass for generation starts from \(p_{\theta }(x_T)\) and produces \(p_{\theta }(x_0)\), which should follow the true data distribution \(q(x_0)\). Therefore, the outputs of diffusion models are not easily controllable; additional input must be provided during the reconstruction process to guide the generation towards user-desired images, as discussed in Sect. 2.4.

2.3 Other generative models: VAEs, GANs

We also discuss generative models aside from diffusion models but refer to (Yang et al. 2023b) for a detailed comparison and in-depth elaboration. Generative Adversarial Networks (GANs) (Goodfellow et al. 2015) are trained using a generator that constructs an output from a random vector and a discriminator that aims to distinguish generated outputs \({\hat{x}}\) from actual samples of the true data distribution \(x \sim q(x)\). An autoencoder (Li et al. 2023b) comprises an encoder and a decoder. The encoder compresses a given input into a latent space, and the decoder attempts to reconstruct the input from this compressed representation. A Variational Autoencoder (VAE) constrains the latent space to a given prior distribution through a regularization term as part of the optimization objective. If the latent space follows a known distribution, sampling (i.e., sample generation) is facilitated.

2.4 Controlling outputs

Diffusion and generative models like VAEs and GANs produce high-quality outputs from random inputs. Controlling outputs is not straightforward and requires additional effort. Early techniques for controllable image synthesis generated outputs conditioned on class labels or by disentangling dimensions of the latent space so that altering a dimension corresponds to a human-interpretable operation (Chen et al. 2016). However, these approaches offer very limited ways to customize outputs. Text-to-image models offer more versatility. They typically encode text and use it as input for generation. Text encoding may be done using a frozen encoder (Saharia et al. 2022) or by modifying the text encoder during the training process (GLIDE (Nichol et al. 2022)). Text-to-image models require (image, text) pairs for training. The textual encoder can derive from the text of the (text-image) pairs only or from a broader corpus and larger models. Text-to-image models can either start generation from a low-dimensional space, such as Dall-E (Ramesh et al. 2022; Betker et al. 2023), or aim to construct directly from pixel space, such as GLIDE (Nichol et al. 2022). To illustrate, we briefly discuss Dall-E (Ramesh et al. 2022; Betker et al. 2023). Dall-E uses a multimodal contrastive model where image and text embeddings are matched. It also includes a text-to-image generator to bridge the gap between CLIP text and image latent spaces, which can be learned using a diffusion prior.

2.5 LLM training

LLM training comprises at least one and up to three phases. These phases differ in training methodology, goals, and data requirements.

Self-supervised pre-training The raw model is trained on vast amounts of data using self-supervised techniques, such as next word prediction (GPT-2) or tasks like predicting masked words and sentence order (BERT). The goal is to learn a flexible and broad representation of text that can serve as a foundation for many different tasks. The data can include various types of text, as described for LLama-2 (Touvron et al. 2023), such as code, Wikipedia articles, or parts of the internet (Common Crawl Foundation 2024).

Instruction tuning This phase adapts a pre-trained LLM through supervised fine-tuning (Lou et al. 2023; Zhang et al. 2023c). The term "instruction" is often used interchangeably with prompt, but instructions tend to be more explicit and directive with precise guidance. The goal is to improve performance on common use cases for LLMs, such as following short instructions. The training data can include task instructions, input (task instance), and desired outputs. Instruction tuning enhances performance on both seen and unseen tasks and across different scales of architectures (Longpre et al. 2023). It can also encode additional domain-specific knowledge, such as medical information (Singhal et al. 2023).

Alignment Tuning LLMs might not incorporate human values and preferences, producing harmful, misleading, and biased outputs (Shen et al. 2023; Weidinger et al. 2022). Alignment criteria can be diverse, covering helpfulness, honesty, and harmlessness (Ouyang et al. 2022). Training data typically comes from humans, who rank LLM-generated answers or produce their own answers. Occasionally, more powerful and already aligned LLMs produce training data (Taori et al. 2023). Training can involve fine-tuning the LLM using supervised learning with the alignment dataset. Alternatively, human feedback can be used to learn a reward model that predicts human scoring of an output. This reward model can adjust the LLM using reinforcement learning, where the LLM generates outputs and receives feedback from the reward model.

3 The importance, challenges and desiderata of GenXAI

This section motivates the need for using eXplainable Artificial Intelligence (XAI) for GenAI (Table 3), discusses its challenges (Table 4), and describes the important desiderata that explanations should fulfill (Table 5).

3.1 Importance of XAI for GenAI

Need to adjust outputsThe need for explainability in GenAI increases as humans control and tailor the generated outputs. Generative AI blurs the boundary between users and developers. It can be instructed to solve tasks using auxiliary knowledge (Lewis et al. 2020), as demonstrated by OpenAI’s GPT-4 store (OpenAI 2023a), which allows ordinary users without programming skills to build and offer applications leveraging uploaded knowledge and GPT-4. A new skill of prompt engineering has emerged, which users need to master (Zamfirescu-Pereira et al. 2023). Experts have identified XAI as a key requirement to support prompt engineering (Mishra et al. 2023). Users need to better understand how to control outputs, handle limitations, and mitigate risks (Weidinger et al. 2022). Therefore, stakeholders must understand GenAI to create solutions aligned with their preferences.

Need for output verificationGenAI, particularly LLMs, are known for many shortcomings, ranging from generating harmful, toxic responses to misleading, incorrect content (Weidinger et al. 2022). LLMs can be said to be "fluent but non-factual," making their verification even more challenging. Their outputs are generally untrustworthy and require some form of verification or validation (as acknowledged by regulators (European Union 2023)). Explanations provide a mechanism to identify errors, such as hallucinations, in outputs.

Increased reachGenAI is easy to access and use through a web interface, facilitating widespread adoption. ChatGPT was the fastest product to reach 100 million users and continues to grow rapidly (Porter 2023). It is used throughout society, including by vulnerable groups such as schoolchildren, the elderly with limited IT knowledge, and corporate employees.

High impact applications GenAI is a general-purpose technology with applications that may have severe immediate and long-term impacts. Users might seek advice for pressing personal problems (Shahsavar and Choudhury 2023) or use GenAI for educational purposes. In an educational context, using ChatGPT once and receiving a slightly biased response (towards gender, race, or minority) might have limited impact, but receiving such responses over a prolonged period could profoundly affect future generations. Aside from this long-term view, users might turn to GenAI like ChatGPT while in psychological distress and seeking immediate advice. ChatGPT is even known to outperform humans in emotional awareness (Elyoseph et al. 2023). Given such high-stakes applications, understanding GenAI becomes crucial.

Unknown ApplicationsSince generative AI can process any text, image, or other medium as input and output, it is difficult to anticipate all possible applications. Therefore, understanding the models more holistically to ensure they align with higher-level principles becomes increasingly important.

Difficult to evaluate automatically Simple strategies, such as counting the number of correct and incorrect answers (accuracy), provide only a limited picture of GenAI’s behavior, as many tasks yield responses that are difficult to score. For example, in summarization, human ratings of gold standards are worse than those of benchmarks, highlighting the challenge in designing benchmarks (Sottana et al. 2023). Therefore, a thorough systematic quantitative evaluation using classical input and output test datasets is difficult, increasing the demand for better model understanding to anticipate potential shortcomings, as automated tests are challenging and possibly insufficient.

Security and safety concerns Various safety and security problems exist with GenAI-based technologies, as seen with adversarial examples (Goodfellow et al. 2015). GenAI enables novel forms of attacks and abuse (Gupta et al. 2023), targeting large numbers of people through social engineering on a massive scale to manipulate elections. Additionally, individuals might leverage GenAI for malicious activities, such as receiving detailed instructions for performing terrorist attacks.

Accountability and legal concerns The need for accountability arises from multiple questions, often driven by legal concerns. GenAI has a complex supply chain of data providers and developers, meaning data might come from many sources, and multiple companies might be involved in building a GenAI system. These include those who create a foundation model (Schneider et al. 2024c), those who fine-tune it for specific tasks, and ultimately, users who adjust it through prompting. This makes accountability a challenging but necessary task, as GenAI systems have the potential to cause harm. Thus, the question of who is responsible and what causes harm becomes more relevant. This can lead to forensic questions like “Why did an AI trigger a certain action (Schneider and Breitinger 2023)? Was it the training, data, or model?” Even if it can be tied to one aspect, such as data, further questions arise, like, "Was it data from a third-party provider, public data, or data from users?" The need for accountability also arises from the use of copyrighted or potentially patented material. In lawsuits, a key question is whether a patent is valid because it is highly original. If GenAI can explain that the solution is not "a copy from an existing patent" but emerges through basic reasoning given "prior art," the judge might be more inclined to rule the patent invalid.

Table 3 Why is explainability important for GenAI?
Table 4 Why is XAI more challenging for GenAI?

3.2 Why is XAI more challenging for GenAI?

Lack of access Commercial GenAI models from large corporations such as OpenAI and Google are among the most widely used systems. However, users, researchers, and other organizations interested in understanding the generative process of specific artifacts or models cannot access the model internals and training data of these models. This limitation rules out many XAI approaches.

Interactivity Some tasks involving humans and GenAI are inherently interactive, such as negotiations between GenAI and humans (Schneider et al. 2023b). Therefore, explanations need to focus not just on the model but also on how the model impacts humans and vice versa throughout the interaction.

Complex systems, models, data, and training Understanding AI becomes increasingly difficult as models grow and process more training data. GenAI, based on very large foundation models, constitutes the largest AI models today, with hundreds of billions of parameters (Schneider et al. 2024c). These models also lead to novel, more complex AI supply chains. Pre-GenAI models were often built using company-internal data, possibly by fine-tuning a model trained on a moderate-sized public dataset such as ImageNet. GenAI systems are built using much larger datasets from various sources, including public and third-party providers. Often, a foundation model is further adjusted through fine-tuning or by retrieving information from external data sources (Lewis et al. 2020) to create a GenAI system. Thus, the final output of a GenAI system might be generated not only through a deep learning model but also by engaging with other tools such as code interpreters (Schick et al. 2024).

Complex outputs Generated artifacts are complex, typically consisting of thousands to millions of bits in an information-theoretic sense. Classical supervised learning models often produce just a single bit for binary decisions and at most a few dozen bits for other classifications, i.e., mostly less than a few million classes. Before GenAI, a classifier’s label constituted a single decision, while GenAI makes a multitude of decisions-one for each aspect of the artifact. Thus, textual outputs and images naturally lead to investigations of several aspects of the output, such as tone, style, or semantics (Yin and Neubig 2022). There are many possible questions about why an artifact exhibits a certain property.

Hard to evaluate explanations It is difficult to evaluate explanations, especially for function-grounded approaches, where GenAI itself is hard to evaluate. Function-grounded evaluation focuses on assessing an XAI method based on predefined benchmarks, for example, using an explanation to classify an object and determining whether the explanation matches the output of the model used to create it. As such benchmarks might not exist and are difficult to create for certain GenAI tasks, evaluating explanations becomes more challenging.

Diverse users A model might be utilized by a diverse set of users across all age and knowledge groups, covering many needs, similar to a search engine. In contrast, pre-GenAI systems are more often tailored to a specific user group and task.

Risk of ethical violations Even commercial GenAI models are known for producing potentially offensive, harmful, and biased content (Achiam et al. 2023). GenAI models also commonly self-explain, which raises the possibility that while the outputs might be ethical, the explanations could be offensive due to inadequate phrasing or visual depictions.

Technical shortcomings GenAI suffers from hallucinations and limited reasoning capability. Consequently, if GenAI models self-explain, these explanations are subject to the same shortcomings

Table 5 Overview of novel and emerging desiderata for GenXAI

.

3.3 Desiderata of explanations for GenAI

There are also several novel and emerging aspects as well as a big change in the relevance of desiderata:

  • Verifiability has recently been discussed as an important aspect of explanations (Fok and Weld 2023). However, the issue that outputs of LLMs cannot be verified (due to hallucinations) was discussed years earlier (Maynez et al. 2020). Suppose explanations cannot be verified, and an explainee must trust the explanation. In that case, the consequence can be the rejection of a correct answer due to an incorrect explanation or failure to detect incorrect outputs. A key concern regarding verification is the effort required to verify. Efforts to understand and tailor explanations have been discussed for XAI in general, for example, in terms of time efficiency (Schwalbe and Finzel 2023).

  • Lineage ensures that model decisions can be traced back to their origins. It involves tracking and documenting data, algorithms, and processes throughout the AI model’s lifecycle. It is crucial for accountability, transparency, reproducibility, and, more generally, governance of artificial intelligence (Schneider et al. 2023a). It concerns the "who" and the "what," such as "Who provided the data or made the model?" and "What data or aspects thereof caused a decision?" While the latter is a well-known aspect of XAI, evidenced by sample-based XAI techniques, the former has not been significantly emphasized in the context of XAI.  (Faubel et al. 2023) established data traceability as a requirement for Machine Learning Operations (MLOPs) in XAI for industrial applications. The need for lineage arises as GenAI supply chains become more complex, often involving multiple companies (Schneider et al. 2024c) rather than just a single one. Additionally, multiple lawsuits have been filed in the context of generative AI, for example, related to copyright issues (Grynbaum and Mac 2023). Regulators have also imposed stringent demands on AI providers (European Union 2023). Therefore, employing GenAI poses legal risks to organizations. Ensuring lineage-supported accountability can serve as risk mitigation.

  • Interactivity and personalization have been previously discussed in XAI (Schwalbe and Finzel 2023; Schneider and Handali 2019). However, as GenAI outputs and systems become more complex, the number of options for how and what to explain has increased drastically. While there is only one bit to explain in a binary classification system, it amounts to millions for a generative AI system that generates images. Thus, it is nearly impossible for a user to understand the reasons behind all possible details of an output. Depending on user preferences and the purpose of the explanation, some aspects might be more relevant to understand than others. This necessitates the user to engage with the system to obtain explanations and for the system to provide adequate explanations tailored to the user’s demand. As discussed in Sect. 4.2.3, systems supporting interactive XAI are increasingly emerging.

  • Dynamic explanations aim to automatically choose the explanation qualities and content (e.g., what output properties are explained) based on the sample, explanation objective, and possibly other information. For example, an explanation might elaborate on why a positive or negative tone was chosen for a generated text. Tonality might be discussed for some samples but not for others. When to include it should align with a specified objective, e.g., the explanation should satisfy the explainee (plausibility), explain the most surprising aspects of the generated artifacts, or explain the most impactful properties of the artifact on its consumers. The goal of dynamic explanations is to maximize the specified objectives while being free to choose the explanation content, including meta-characteristics such as what aspects are included in the explanation, its structure, etc.

  • Costs related to XAI might also become an emerging area. The economics of XAI have been elaborated in (Beaudouin et al. 2020), where concerns about the costs of implementing XAI (and transparency) are mentioned. For corporations, GenXAI adds the risk of leaking value. Competitors (or academics) might prompt a model to save on the costs of training data. For example, the Alpaca model (Taori et al. 2023) was trained on data extracted from one of OpenAI’s models, allowing the avoidance of the costly and time-consuming task of collecting data from humans. Using explanations, such as part of chain-of-thought (CoT) (Wei et al. 2022), can further enhance performance and thus add value.

  • Alignment criteria, such as helpfulness, honesty, and harmlessness (Askell et al. 2021), which are relevant for GenAI in general, also play a role for XAI. Some of these criteria partially overlap with existing ones, such as plausibility (with helpfulness) and faithfulness (with honesty). For instance, explanations should not be deceptive (Schneider et al. 2023c). Aspects such as harmlessness have received less attention, as explanations were commonly simpler, such as attribution-based explanations. Harmlessness implies that explanations do not contain offensive information or information about potentially dangerous activities (Askell et al. 2021). It also relates to security, discussed next:

  • Security is increasingly evolving as a desideratum. Explanations should not jeopardize the security of the user or the organizations operating the GenAI model. Providing insights into the reasoning process might facilitate attacks or be leveraged in competitive situations, leading to poorer outcomes for GenAI. For example, in a recent study on price negotiations between humans and LLMs (Schneider et al. 2023b), humans asked an LLM what decision criteria it used and then systematically exploited this knowledge to achieve better outcomes against the LLM. A human negotiator might not disclose such information. Thus, openness can be abused. “Security through obscurity” is one protection mechanism against attacks. Corporations might also aim to protect their intellectual property. For example, customer support employees (and GenAI models) might have access to some relevant information about a product to help customers but might not be allowed to obtain explanations on how the product is manufactured, as this might constitute a valuable company secret.

  • Uncertainty Understanding the confidence of outputs is an important aspect of XAI. While most works aim to explain a decision, explaining the uncertainty of a prediction has also garnered attention (Molnar 2020; Gawlikowski et al. 2023). Deep learning models, such as image classifiers, are known to be overconfident, and multiple attempts have been made to address this issue (Meronen et al. 2024; Gawlikowski et al. 2023). LLMs arguably take this a step further, as they often generate answers eloquently even if they are wrong, i.e., they are "fluent but non-factual." Still, LLMs have some (though not perfect) understanding of whether they can answer a question (Kadavath et al. 2022). They can also be enhanced with uncertainty estimation techniques (Huang et al. 2023c).

Prior research has extensively discussed the principles and desiderata of explanations. Here, we briefly summarize the key characteristics explanations should exhibit, drawing on prior surveys (Lyu et al. 2024; Schneider and Handali 2019; Schwalbe and Finzel 2023; Bodria et al. 2023; Guidotti 2022), and discuss these characteristics in the context of GenAI. Among the most important and well-known desiderata are:

  • Faithfulness (= fidelity = reliability) An explanation should accurately reflect the reasoning process of a model. Due to the complexity of GenAI models, a higher level of abstraction seems necessary to keep explanations comprehensible within a limited amount of time.

  • Plausibility (= persuasiveness = understandability) An explanation should be understandable and compelling to the target audience. Textual explanations, especially self-explanations, are often easier to understand compared to classical explanations such as SHAP values.

  • Completeness (= coverage) and minimality: An explanation should contain all relevant factors for a prediction (completeness) but no more (minimality). Complete coverage becomes less feasible with the growth of output, data, and model sizes. Personalized, interactive explanations that allow users to control explanations and XAI techniques that automatically select only interesting properties to be explained could be the way forward.

  • Complexity The total amount of conveyed information in an explanation, typically measured relative to an explainer’s knowledge, i.e., subjectively. This aspect gains importance as stakeholders become more diverse.

  • Input and model sensitivity and robustness Changes in the input (or model) that impact model outputs should also lead to changes in explanations (sensitivity). However, if changes in inputs do not alter model behavior, or changes in the model do not significantly alter processing and outputs, then explanations should not change disproportionately (robustness). This still holds for GenAI.

4 Taxonomy of XAI techniques for GenAI

Our taxonomy provides a scheme for classifying XAI mechanisms and algorithms that support the understanding of GenAI (Sect. 4.1), which we use to classify existing techniques (Sect. 4.4).

4.1 Dimensions of taxonomy

The key characteristics of our taxonomy are summarized in Table 6. We distinguish between the output (i.e., explanation) properties, input, and internal properties of GenXAI algorithms. Explanation properties characterize the outputs of XAI algorithms, i.e., the explanations, in terms of scope (what fraction of samples, attributes, and parts of the interaction they explain), modality (unimodal or multi-modal), and interactivity (whether the user can engage in obtaining additional explanations or tailor explanations). Input and internal properties relate to what the XAI algorithms require to produce explanations and how these explanations are obtained. While many ideas on structuring XAI are still valid in the context of GenAI, we focus primarily on novel dimensions such as the foundational source for XAI, which can be data, model, training, or prompt. The source forms the key mechanism or artifact leveraged by XAI techniques to generate explanations, as elaborated in Sect. 4.3.1. One might classify the first three sources (data, model, and training) under the category of intrinsic methods, as they impact the resulting model. However, data can also be extrinsic, particularly in Retrieval-Augmented Generation (RAG). Similarly, for prompts, Chain-of-Thought (CoT) encourages and guides XAI but also relies to some extent on training. If training data lacks any form of explanation, CoT prompting will not work. This is most evident in extreme cases where words like "because" are removed from the training data, implying that they will never be generated.

Table 6 Dimensions of our taxonomy for GenXAI algorithms

4.2 Explanation properties

We elaborate on the dimensions related to outputs of XAI methods, i.e., explanations.

4.2.1 Scope

We discuss scope in terms of output, interaction and input scope. Often, scope (Schwalbe and Finzel 2023) only refers to what inputs are explained by a method, i.e., a single sample (local) versus all samples (global), meaning the model behavior in general (Guidotti et al. 2019; Bodria et al. 2023). Thus, the scope indicates the quantity of the input samples explained, which we call input scope. For GenAI, scope also refers to the quantity of the output that is explained, i.e., a single attribute of the output (focused) versus all attributes (holistic). We call this output scope. This is illustrated with an example in Fig. 4. As outputs are significantly more complex for GenAI, they present more options for questions. For example, why did the response contain certain information and not other information? Why was the sentiment of a generated sentence positive, neutral, or negative? While some of our methods touch on these questions, there is limited work overall in this direction. Similarly, there is limited understanding of how to relate training data to predictions beyond classical approaches such as influence functions, i.e., how a specific piece of knowledge in the training data impacted the generation process.

Fig. 4
figure 4

Illustration of input and output scope, where we use “tonality” as an example of a text-related attribute. Traditionally, scope only referred to output scope

Interaction scope: For GenAI, interactions are commonly in the form of dialogues, making it more relevant to understand both sides of the system. Therefore, we distinguish two goals:

  • Explaining (single) input–output relations: This is the classical notion, where an AI system processes one input to produce one output. The goal can be to understand a particular instance (local explanation) or the model as a whole. While explanations might consider user-specific aspects and personalize explanations (Schneider and Handali 2019), such personalization is typically independent of the interaction.

  • Explaining the entire interaction: It focuses on human-AI interaction and its dynamics, for example, the communication and actions between an AI and a human when a human solves a task using an AI. Interactions are characterized by multiple rounds of outputs generated by both sides conditioned on prior outputs. In this case, the goal is to more holistically explain (i) the dynamics of the interaction, i.e., not just one single input–output but the entire sequence of in- and outputs and (ii) the outcome of the interaction, which could be why a particular artifact such as an image was generated in a certain way, but also why a user did not complete a task, for example, abandoned it prematurely or achieved an unsatisfactory output. Interaction dynamics are influenced by a series of technical (such as model behavior including classical performance measures but also latency, user interface, etc.) and non-technical factors (such as human attitudes and policies). As such, human-AI interaction cannot easily be associated with one scientific field but is inherently interdisciplinary. Explainability, which aims at understanding AI technology, should focus on how technical factors related to model behavior impact the interaction. While many existing works touch on the subject, the change in interactivity brought along by prompting due to GenAI is not well understood. A study that investigated interactivity in negotiations was (Schneider et al. 2023b). However, the explanations for outcomes of negotiations and interaction behavior are not through algorithms but rather through a manual, qualitative investigation of interaction—a common technique in social sciences but less prevalent in computer science.

The two goals are illustrated with an example in Fig. 5.

Fig. 5
figure 5

Illustration of input and interaction scope, where we use “tonality” as an example of a text-related attribute. Traditionally, scope only referred to output scope

The field of human-computer interaction has aimed at explaining human interactions for a significant amount of time (MacKenzie 2024; Carroll and Olson 1988), with some effort also devoted to discussing human-AI interaction. For example, guidelines for human-AI interaction (before GenAI) are well-studied (Amershi et al. 2019), as is the topic of explanation in collaborative human-AI systems (Theis et al. 2023). Furthermore, explanations in human-AI systems typically encompass objectives driven by non-technical concerns (Miller 2019; Mueller et al. 2021). However, there is less work on explaining human-AI interactions themselves (Sreedharan et al. 2022), particularly targeted towards GenAI. To provide two examples from the pre-GenAI area spanning from in-depth technical study to broader organizational studies: (Schneider 2022) discussed how human-AI interaction could be optimized, accounting for long-term goals such as preserving human diversity. The work explains how the user can improve their interaction to reduce error rates and become more efficient. Grisold and Schneider (2023) investigated the dynamics of AI adoption within an organization, explaining how error rates and the learning behavior of an AI impact the complexity of processes. Explanations in interactive systems in pre-GenAI (Rago et al. 2021) and GenAI (see Sect. 4.2.3) are typically more concerned with supporting interactive explorations of a decision, for instance, querying a system to better understand it, rather than obtaining explanations on how a sequence of inputs and outputs emerged.

4.2.2 Explanation modality

Commonly, explanations are unimodal, such as textual, visual, or numeric. However, multi-modal explanations have also been investigated (Park et al. 2018) by collecting a dataset containing textual and visual justifications. This research shows that multi-modal explanations yield favorable outcomes, with each modality improving due to the presence of the other.

4.2.3 Dynamics

Interactivity In a classical setting, XAI is often non-interactive, meaning the explainee has limited options to control explanations or request additional information. However, the idea of interactivity in XAI has existed for some time, as shown in prior taxonomies (Schwalbe and Finzel 2023). One idea is to reconceptualize XAI as a dialogue (Singh et al. 2024). Slack et al. (2023) supports the explainability of a machine learning model by using an LLM to translate natural language queries into a predefined set of operations related to explainability, such as identifying the most important features or changing a feature. Gao et al. (2023) uses LLMs to create an interactive and explainable recommender system. In terms of classical metrics such as recall and precision, the system is not outperforming. Improving models due to human explanations in an interactive setting has also been studied for image recognition (Schramowski et al. 2020). While LLMs are known for their superior performance in many Natural Language Processing (NLP) tasks, they might also be employed without improving classical metrics such as accuracy, precision, or recall, but rather to provide other benefits, such as explainable, interactive systems requiring less training data (Gao et al. 2023; Deldjoo 2023). Deldjoo (2023) showed that binary risk classification can be performed with 40x less data to roughly match the performance of a classical machine learning system through explainable-guided prompts.

Static vs. dynamic explanation qualities and content relates to how the structure and content of explanation vary based on the sample due to the algorithm. This includes which properties of the generated artifact are explained and how they are explained. This choice can be independent of the sample (static), i.e., identical for each sample, or dependent on the sample itself (dynamic). For example, classical methods like attribution maps are static, as they only explain relevance and always assign a relevance value for each input pixel. In contrast, textual self-explanations can be dynamic. The explained properties can be selective, e.g., for a generated image, an explanation like "A moving, red car was shown to make the image more vivid and engage the viewer" only describes one part of the image and one property of that part. Explanations are dynamic if the structure and content of the explanation vary with the sample. For example, if the XAI technique yields a generated image of a mountain landscape: "A snowy landscape was chosen as the prompt contained the word ’bright.’" Explanations are dynamic if the two exemplary explanations refer to different objects (car vs. landscape), differ in explained properties (explaining emotional intention vs. not discussing it), and differ in causal factors discussed (mentioning what inputs caused output properties, i.e., "bright" in prompt vs. not doing so).

4.3 Input and internal properties

We elaborate on the dimensions related to the inputs and internals of XAI methods (in contrast to their outputs).

4.3.1 Foundational sources for XAI techniques

We consider the data, model, and optimization (training) and prompt the foundational sources for XAI methods. That is, each source can be modified or tailored to improve XAI.

Model-induced XAI refers to intrinsic XAI methods (also known as model-specific XAI (Guidotti et al. 2019)), where the design of the model is altered to foster explainability. With GenAI, novel intrinsic methods have emerged.

Using interpretable components Deep learning comprises multiple layers and components, such as activation functions and attention mechanisms. These components can be more or less complex, affecting the overall explainability of the model. For example, interpretable activation functions might replace conventional, less interpretable functions. SoLU (Elhage et al. 2022) is said to enhance interpretability. These functions are based on the idea that there are more features than neurons per layer, and, in turn, superposition yields additional features (Olah et al. 2020). The paper also provides evidence supporting this hypothesis. The interpretability makes some neurons more understandable but hides others. Thus, while the method claims to provide a net benefit overall, it is not without costs. Classical attention layers are also commonly used, though their value for XAI is debated (see Sec. 4.4.1).

Interpretable GenAI models through additional models Model interpretability can be achieved by combining LLMs with other models and often with external knowledge. For example, in (Chen et al. 2023c), an LLM is enhanced with a GNN and external knowledge to generate an explanation and prediction jointly. Creswell and Shanahan (2022) fostered explicit multistep reasoning by chaining responses of two fine-tuned LLMs; one performs selection, the other inference. Wang and Shu (2023) incorporates external knowledge using classical internet search (as done in commercial products such as Bing Chat). Additionally, it uses first-order logic to create easier-to-verify subclaims that jointly lead to the overall claim. It also generates explanations by querying the LLM. Vedula et al. (2023) developed and trained a decoder for faithful and explainable online shopping product comparisons.

Optimization Adjusting the optimization objective is a common technique to foster explainability. Common strategies involve disentangling latent dimensions (Ross et al. 2017) and training for XAI-relevant criteria, such as citing evidence (Menick et al. 2022). In the context of GANs, Chen et al. (2016) explicitly trained disentangled representations by maximizing the mutual information between a small subset of the latent variables and the sample. This approach allowed the so-called InfoGAN to separate writing styles from digit shapes on the MNIST dataset. Furthermore, in (Ross et al. 2017), explanations have been used to constrain training, aiming "to be right for the right reasons." Note that while disentanglement is commonly achieved through training objectives and regularization, special network architectures like capsule networks (Sabour et al. 2017) and others (see the section on Disentanglement in (Schwalbe and Finzel 2023)) can also support disentanglement. Diffusion models already have a semantic latent space (Kwon et al. 2022). A special reverse process can leverage this for image editing using CLIP (Ramesh et al. 2022), which iteratively improves reconstructed images. (Menick et al. 2022) trains multiple models to cite evidence for claims, facilitating answer understanding and, especially, fact-checking.

Training data can support XAI in multiple ways. While the exact impact of training data and dynamics is not yet fully explored (Teehan et al. 2022), current findings indicate that training data is a crucial factor in supporting explainability, particularly for textual data.

Training data composition: GenAI training data is usually complex, even for a single modality. For example, text data might consist of programming code, books, dialogues, etc. The composition of the training data can impact reasoning, i.e., code can improve certain reasoning tasks (Ma et al. 2023). While less explicit statements are known for explanations, the overall data composition is also likely to impact XAI.

Explanation quality and quantityAside from the overall composition of training data, the presence and absence of explanations in the training data also impact XAI. While it is well-known that the quality of training data strongly impacts a model’s performance, the impact on XAI has received less attention. However, as GenAI can self-explain, as seen with CoT (Wei et al. 2022), i.e., generate explanations as part of outputs, training data strongly impacts performance. For illustration, assume that a model is trained with erroneous explanations but potentially correct results; it could perform well on tasks but provide poor explanations. Additionally, if the training data does not contain any data explaining the reasoning, the model might perform worse at explanations. In an extreme case, if the training data does not contain explanations, including explanatory words such as "due to," "for that reason," and "because," the model will also never generate such words (and the corresponding explanations).

Domain specialization (Ling et al. 2023) can tailor the model more toward a specific domain, potentially at the cost of abilities in other domains (Chen et al. 2020). It can also contribute towards explainability, as a narrower model focus implies fewer potential options for explanations and, thus, a lower risk of errors and easier verifiability.

Prompts can also induce explanations: LLMs can be prompted (i) to provide explanations in a preferred manner or (ii) be constrained to rely on a limited set of given facts to create responses, facilitating verification. The idea of "rationalization" dates back to the early 2000 s (Zaidan et al. 2007). One type of explainability is “justifying a model’s output by providing a natural language explanation” (Gurrapu et al. 2023). Commonly, explanations clarifying the reasoning process are elicited through the use of chain-of-thought (CoT) prompting (Wei et al. 2022). Such prompting allows the structure of the reasoning process, implicitly shaping explanations and utilizing external knowledge for each reasoning step to yield more faithful explanations (Wei et al. 2022; He et al. 2022). However, explanations can also contain hallucinations, as shown in the context of few-shot prompting (Ye and Durrett 2022) and sensitivity to inputs for CoT prompting (Turpin et al. 2024). Despite their unreliability (due to a lack of faithfulness in explaining the true inner workings), they might still be valuable for verifying output correctness or in training (smaller) models (Zhou et al. 2023). Additionally, LLMs can be constrained to rely on information from the user-given prompt or extracted from an external database or the internet, a process known as Retrieval-Augmented Generation (RAG) (Lewis et al. 2020). RAG facilitates understanding and verifying the response of an LLM, as the source of information for the answer is known and typically much smaller than the entire training data of a GenAI model. To facilitate explainability for recommender systems, personalized prompt learning, such as soft-prompt tuning to yield vector IDs, has been conducted (Li et al. 2023a).

4.3.2 Required model access by XAI method

Depending on what is to be explained and the XAI method used, different information is needed to obtain explanations (Schwalbe and Finzel 2023). Black-box access does not provide information on likelihoods other than the predicted output, and even such probabilities might not be available. That is, internal access, such as to activations and gradients, is commonly unavailable. White-box access refers to having complete access to the model, its training data, and its training procedure. In between, there is a wide range of grey-box access (Schneider and Breitinger 2023). An important restriction for XAI techniques investigating commercial models is that these models are typically only accessible as a black box, such as GPT-4 through an API. Similarly, commercial vendors do not share their training data and often do not even disclose a basic summary of the training dataset. Thus, XAI techniques leveraging training data are not easily employable.

4.3.3 Model (self-)explainers

GenAI, particularly LLMs, provide explanations for their own decisions (self-explain) or serve as explainers in general. That is, the model itself provides explanations rather than relying on a dedicated XAI technique. This contrasts with the classical notion of intrinsic XAI, which often denotes understandable, simple models such as decision trees and linear regression. While self-explanations are not necessarily faithful (Turpin et al. 2024), attempts have been made to improve their accuracy. For example, Chuang et al. (2024) proposed to use an evaluator to quantify faithfulness and optimizing faithfulness scores iteratively. Though self-explanations are not necessarily accurate or faithful, they can be helpful, as demonstrated in a complex environment where agents performed multistep planning and improved through self-explanation (Wang et al. 2023). LLMs can serve as explainers by providing explanations for essentially anything. For instance, they can self-explain by generating explanations tailored to their outputs (Wei et al. 2022; Turpin et al. 2024), support explaining other machine learning models (OpenAI 2023b; Slack et al. 2023; Singh et al. 2023), provide explanations by analyzing patterns in data through autoprompting (Singh et al. 2022), support people in self-diagnosis (Shahsavar and Choudhury 2023), or yield interpretable autonomous driving systems (Mao et al. 2023). In fact, on free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowd workers’ gold references (Ziems et al. 2023). For mental health analysis, the quality of LLM explanations approaches that of human explanations (Yang et al. 2023a).

4.3.4 Explanation sample difficulty

Not all input samples are equally challenging to explain (Saha et al. 2022). The idea that some samples and interactions are easier to explain than others gains relevance as the variance in possible inputs and outputs increases. For example, LLMs allow us to ask the simplest to the most complex questions from a human perspective. LLM-generated texts might range from a "lookup" of a fact learned from the training data to long stories or solutions to complex tasks. Explanations (as judged by humans) might be poorer for difficult samples than for simple ones. More precisely, this has been observed when explaining data labels with GPT-3, where GPT-3 explanations degraded much more with example difficulty than human explanations (Saha et al. 2022). Generally, the idea of distinguishing XAI tasks based on their difficulty has received little attention so far. From a computational perspective, current explanation algorithms require an identical amount of computation regardless of the difficulty. Also, forward computations typically proceed identically regardless of sample difficulty, though the notion of reflection (and thinking fast and slow) has been mentioned in the literature (Schneider and Vlachos 2023b). LLMs can benefit from ensemble methods, such as combining multiple outputs into a single output (Huang et al. 2023b). There are numerous works on self-correction (Pan et al. 2023). For example, self-debugging using self-generated explanations as feedback has been linked to reducing coding errors by LLMs (Chen et al. 2023b). However, the ability of LLMs to self-correct reasoning by "reflecting" on their responses without additional information has also been questioned (Huang et al. 2023a). While state-of-the-art systems like GPT-4 improve scores on causal reasoning benchmarks, which can serve as explanations, they also exhibit unpredictable failure modes (Kıcıman et al. 2023).

4.3.5 Dimensions of Pre-GenAI

There is also a long list of concepts relevant to our taxonomy that originate from pre-GenAI (Adadi and Berrada 2018; Zini and Awad 2022; Dwivedi et al. 2023; Schwalbe and Finzel 2023; Räuker et al. 2023; Saeed and Omlin 2023; Speith 2022; Minh et al. 2022), which we will not elaborate on in detail. For example, XAI methods can be classified according to what is to be understood, which includes the system, model-related information such as representations (layers, vectors, embeddings) and outputs, training dynamics, and the impact of data. This distinction was already made before GenAI (Schwalbe and Finzel 2023). The most common ways to structure XAI techniques in prior works are, unfortunately, not conceptually clean. For example, a common distinction is between mechanistic and feature attribution-based techniques. However, conceptually, feature attribution relates to how the explanation looks (i.e., relevance scores for output). Meanwhile, mechanistic interpretability aims more at what is being investigated (i.e., neurons and interactions) and how the techniques work (through reverse engineering). As the names of the existing categories are well-established and thus easy for readers to comprehend, we shall use them in our classification of techniques shown in the next section while also discussing classification based on our novel dimensions shown in Table 6.

4.4 Classification of techniques

We categorize XAI techniques into four groups commonly found in existing literature (see prior surveys in Sect. 1). Figure 6 provides an overview of these techniques. Our focus is on techniques developed specifically for GenAI models or classical techniques that have been adjusted for GenAI (mostly by addressing computational issues) or could be employed without significant changes. We also structure existing techniques in terms of our novel dimensions (Sect. 4.4.5).

Fig. 6
figure 6

Overview of categories of techniques with illustrative examples (Figures are from cited references)

4.4.1 Feature attribution

Feature attribution assigns a relevance score to each input feature, such as a word or pixel.

Perturbation-based techniques partially alter inputs, for example, by removing or changing features and investigating output changes. In the context of NLP, alterations such as removing tokens (Wu et al. 2020), as well as negating and intensifying statements (Li et al. 2016), have been investigated.

Gradient-based methods require a backward pass from outputs to inputs to obtain derivatives. While not all gradient-based techniques work reliably (Adebayo et al. 2018; Ghorbani et al. 2019), some techniques like Grad-CAM (Selvaraju et al. 2017), which compute a function of gradients and activations, have proven valuable at the pixel level in images and for token-level attribution (Mohebbi et al. 2021). Directional gradients have also been used in NLP models (Sikdar et al. 2021; Enguehard 2023). Integrated gradients have been used to attribute knowledge to internal neurons (Lundstrom et al. 2022; Dai et al. 2022), as discussed under "Neuron activation explanation." Simple first derivatives concerning embedding dimensions are shown in (Li et al. 2016).

Surrogate models approximate large models using much simpler models, often to understand individual predictions. Classical methods include LIME (Ribeiro et al. 2016) and SHAP (Lundberg and Lee 2017), which have been adapted for transformers (see (Kokalj et al. 2021) for SHAP). Furthermore, attention flows in NLP models have been shown to relate to SHAP values (Ethayarajh and Jurafsky 2021). Explain Any Concept (EAC) (Sun et al. 2024) presents an approach for concept explanation, utilizing the Segment Anything Model (SAM) (Kirillov et al. 2023) for initial segmentation and introducing a surrogate model to enhance the efficiency of the explanation process. SAM excels at producing object masks from input prompts like partial masks, points, or boxes. It can generate masks for all objects in an image. SAM is trained on a vast dataset that includes 11 million images and 1.1 billion masks.

Decomposition-based methodsDecomposition traditionally refers to attributing relevance from outputs to inputs or decomposing vectors, but in the context of GenAI, it can also refer to decomposing the reasoning process and attributing outputs to specific reasons. Liu et al. (2022) aims to explain the reasoning process for question answers using entailment trees constructed through reinforcement learning. An entailment tree has a hypothesis as its root, reasoning steps as intermediate nodes, and facts as leaves. Common decomposition techniques compute relevance scores on a layer-by-layer level so that contributions of upper layers emerge as a combination of lower-level contributions. A classic example is Layer-wise relevance propagation (LRP) (Montavon et al. 2019). Decomposition-based methods have also been applied to transformers (Ali et al. 2022). Ali et al. (2022) claimed that their adaptation of LRP mitigates shortcomings of gradient methods, which are said to arise due to layernorm and attention layers. Linear decomposition has also been suggested for local interpretation for transformers (Yang et al. 2023c), where a decomposition is considered interpretable if it is orthogonal and linear. Decompositions are often vector-based (Luo and Specia 2024). They express a vector (such as a token embedding) in terms of more elementary vectors. For example, Modarressi et al. (2023) decomposed token vectors and propagated them through the network while maintaining accurate attribution. Zini and Awad (2022) surveyed XAI methods for word embeddings. One idea is to express embedding vectors in terms of an orthogonal basis of interpretable basis (concept) vectors; another technique employs external knowledge and sparsification of dense vectors.

Attention-basedAttention is a key element within neural networks, providing an importance score for inputs, which are not necessarily the initial inputs to the network but those of a prior layer. For LLMs, attention scores are commonly obtained between all input token pairs for a single attention layer and can be visualized using a heatmap or bipartite graph (Vig 2019). Relevance scores have also been computed by combining attention information with gradients (Barkan et al. 2021). Attention-based methods have been scrutinized because they might not identify the most relevant features for predictions (Serrano and Smith 2019; Jain and Wallace 2019). However, the debate remains unsettled. Stremmel et al. (2022) focus on explaining language models for long texts by leveraging sparse attention and developing a masked sampling procedure to identify text blocks contributing to a prediction. Some listed techniques leverage several ideas. For instance, Modarressi et al. (2023) can be considered both a vector-based and decomposition-based method.

4.4.2 Sample-based

Sample-based techniques investigate output changes for different inputs. In contrast to perturbation-based methods that selectively change individual features to investigate their impact, sample-based techniques focus more on the entire sample to understand the relationship between various inputs and their corresponding outputs rather than attributing the output to specific features within a single input.

“Training data influence” measures the impact of a specific training sample on the model, typically on the output for a particular input. Grosse et al. (2023) addressed computational issues to employ influence functions for LLMs. Explainability has been transferred from a large natural language inference dataset to other tasks (Yordanov et al. 2021).

Adversarial samples are input alterations due to small, hard-to-perceive changes for humans that lead to a change in outputs. They are typically discussed in the context of cybersecurity, where an attacker aims to alter model outputs while a human should not notice the input change. However, schemes that aim to "trick" humans and classifiers alike have also been proposed (Schneider and Apruzzese 2023). For example, SemAttack (Wang et al. 2022a) perturbs embeddings of BERT tokens, while other attacks exchange words (Jin et al. 2020). Parts of inputs can also be occluded to better understand model behavior (Schneider and Vlachos 2023a).

Counterfactual explanations seek to identify minimal changes to an input so that the output changes from a class y to a specific class \(y'\). In contrast to adversarial samples, changes can be noticed by humans. For example, GPT-2 has been fine-tuned to provide counterfactuals based on pairs of original and perturbed input sentences (Wu et al. 2021). Exploring LLM capabilities through counterfactual task variations, Wu et al. (2023) has shown that LLMs commonly rely on narrow, context-specific procedures that do not transfer well across tasks. Augustin et al. (2022); Jeanneret et al. (2022) use diffusion models guided by classifiers to create counterfactual explanations.

Contrastive explanations explain why a model predicted y rather than \(y'\). Contrastive explanations are said to better disentangle different aspects (such as part of speech, tense, semantics) by analyzing why a model outputs one token instead of another (Yin and Neubig 2022).

4.4.3 Probing-based

Probing-based methods aim at understanding what knowledge an LLM has captured through "queries" (probes). A classifier (probe) is commonly trained on a model’s activations to distinguish different types of inputs and outputs.

Knowledge-based For example, encoders such as BERT, MiniXX, and T5 that produce vectors can be probed by training a classifier on their outputs to identify the presence of properties or abilities that emerge from the inputs, such as syntax knowledge (Chen et al. 2021) and semantic knowledge (Tenney et al. 2019). Alternatively to using classifiers, datasets focusing on specific aspects such as grammar (Marvin and Linzen 2018) can be created. The model’s performance on the dataset indicates its ability to capture the property. The design of datasets requires care, as regularities might provide an opportunity for shortcut learning (Zhong et al. 2021), which foregoes learning the properties in favor of identifying such dataset-specific regularities. Hernandez et al. (2023) learns how to map statements in natural language to fact encodings (in an LLM’s representation). In turn, this allows a new way to detect (and explain) when LLMs fail to integrate information from context. The research argues that untruthful texts result from not integrating textual information into specific internal representations. For example, Liu et al. (2021) investigated the training of RoBERTa over time through probing. They found that local information, such as parts of speech, is acquired before long-distance dependencies such as topics. Goyal et al. (2022) analyze the learning of text summarization capability of a large language model by obtaining summaries and output probabilities for a fixed set of articles at different points during training. They show n-gram overlap between generated summaries and original articles over time, concluding, for example, that models learn to copy early during training (leading to high overlap), which decreases over time (decrease in overlap).

Concept-based explanation Typically, given a set of concepts, concept-based explanations provide relevance scores for these concepts within inputs (Kim et al. 2018). More recently, it has also been proposed to uncover concepts based on what input information is still present at specific layers (or embeddings) (Schneider and Vlachos 2023a). While the latter investigates images, high-impact concepts as a source of explanation have also been used for LLMs (Zhao et al. 2023b). Foote et al. (2023) interprets a large set of individual neurons by constructing a visualizable graph using the training data and truncation and saliency methods. However, concept-based methods must be designed carefully, as merely investigating interactions among input variables might be insufficient to show that symbolic concepts are learned (Li and Zhang 2023).

Neuron activation explanationIndividual neurons can also be understood using their activations for inputs. Recently, GPT-4 has been used to generate textual explanations for individual neuron activations of GPT-2 (OpenAI 2023b). For example, GPT-4 summarizes the text that triggers large activations for a neuron. Dai et al. (2022) uncovered "knowledge neurons" that store particular facts. They performed knowledge attribution by setting neuron weights to 0 and then increasing them to their original value while summing up the gradient. If the neuron is relevant for a particular fact, the sum should be large.

4.4.4 Mechanistic interpretability

Mechanistic interpretability investigates neurons and their interconnections, aiming to reverse-engineer model components into human-understandable algorithms (Olah 2022). Models can be viewed as graphs (Geiger et al. 2021), and circuits (i.e., subgraphs) can be identified that yield certain functionality (Wang et al. 2022b). Common approaches fall into three categories (Luo and Specia 2024): circuit discovery, causal tracing, and vocabulary lens. The typical workflow to discover a circuit is often manual and involves (Conmy et al. 2024): (i) Observing a behavior (or task) of a model, creating a dataset to reproduce it, and choosing a metric to measure the extent to which the model performs the task; (ii) Defining the scope of interpretation (for example, the layers of the model); and (iii) Performing experiments to prune connections and components from the model. Circuit-based analysis can also focus on specific architectural elements. For instance, feedforward layers have been assessed and associated with human-understandable concepts (Geva et al. 2022). Additionally, two-layer attention-only networks have been investigated, leading to conjectures about how in-context learning might work (Olsson et al. 2022). Recent work has automated the process of finding connections between abstract neural network units that constitute a circuit (Conmy et al. 2024). Modern causal tracing commonly estimates the impact of intermediate activations on their output (Meng et al. 2022). While causal tracing moves from activations to outputs, the vocabulary lens focuses on establishing relations to the vocabulary space. For example, Geva et al. (2022) project weights and hidden states to the vocabulary space. Individual tokens have also been assessed (Ram et al. 2022; Katz and Belinkov 2023). Katz and Belinkov (2023) create information flow graphs showing human-readable tokens based on processed vectors within attention heads and memory values. These vectors are mapped to tokens.

4.4.5 Structuring based on novel dimensions

We also classify existing techniques based on uncovered characteristics, as shown in Table 6. Regarding scope, no existing technique explicitly focuses on the entire interaction. However, LLMs providing self-explanations could be used for that purpose. When it comes to explaining one, multiple, or all properties of the output, a feature attribution map typically explains at most one attribute of the output. For example, by highlighting positive and negative words or phrases, sentiment might be explained. Sample-based techniques can potentially explain multiple attributes. For example, if the chosen samples share different characteristics, such as images showing the same object but in different poses and colors, it can be concluded that the latter two attributes are irrelevant. Mechanistic interpretability and probing often aim to isolate a single concept, but this is not a requirement. Existing XAI techniques are typically non-dynamic. Explanations from LLMs can be interactive, personalized, and often sample-dependent. Table 7 shows a concept matrix linking the foundational source to XAI methods. Feature attribution techniques are typically posthoc methods that leverage the model to generate explanations. However, LLMs could also be prompted to generate feature attribution explanations. Sample-based techniques commonly rely on the training data and the model, for example, to determine which samples most strongly activate a particular neuron. While mechanistic approaches commonly rely on a dataset, it does not have to be from the training data. Probing can be performed naturally through prompts and other forms of input, followed by analyzing the corresponding outputs.

Table 7 Mapping of XAI categories and selected techniques to foundational sources for XAI

Most existing techniques require white-box access 8. It might be possible to gain limited insights using black-box access only, for example, using occlusion (or masking) for feature attribution techniques. It might be investigated if predictions change, but such an approach typically yields coarser and less accurate explanations compared to having access to the output probabilities (Table 8).

Table 8 Mapping of XAI categories and selected techniques to required access by XAI techniques

5 Research Agenda of XAI for GenAI, Discussion and Conclusions

As the field of XAI for GenAI is quickly emerging, there are many opportunities for research covering more technical algorithmic avenues as well as economic and psychological aspects. Bridging the gap between AI research and other disciplines, such as cognitive science, psychology, and the humanities, is likely the way forward for many topics. For instance, evaluating explanations often requires expertise beyond AI, such as domain expertise demonstrated in studies involving product engineers (Johny et al. 2024) or psychologists who provide theoretical frameworks and insights on human behavior, e.g., when assessing the suitability of LLM outputs for children (Schneider et al. 2024b). Furthermore, policymakers and AI experts need to collaborate to develop regulations that are both implementable and effective in limiting risks due to legal uncertainties for companies (Walke et al. 2023; Schneider et al. 2024a). More generally, possible directions include:

  1. 1.

    Explaining interactions rather than single input–outputs is particularly interesting for interdisciplinary research conducted in fields such as information systems and human-computer interaction, as it requires an understanding of humans, models, and systems (see "Interaction Scope" in Sect. 4.2.1).

  2. 2.

    Real-time and interactive explanations based on user queries and feedback could be further explored, necessitating insights from multiple disciplines (Sect. 4.2.3).

  3. 3.

    Multimodal GenXAI: Currently, most explanations are of a single modality, commonly either text or visual. There is a lack of techniques providing explanations using more than one modality (Sect. 4.2.2).

  4. 4.

    Adding XAI to novel directions of GenAI, such as GenXAI for video, 3D content generation, and actions (see Table 2), is urgently needed.

  5. 5.

    Deriving novel XAI techniques, particularly in the field of mechanistic interpretability (Sect. 4.4.4), which investigates the inner workings of GenAI models, is a promising yet challenging frontier.

  6. 6.

    Addressing verifiability and hallucinations using AI: Hallucinations are one of the biggest challenges of GenAI. GenXAI can help mitigate them using techniques such as Chain-of-Thought, which explains the reasoning process, but more techniques are needed.

  7. 7.

    Porting pre-GenAI XAI techniques by addressing computational concerns of classical techniques to apply them to large models, as was done for techniques like SHAP (see Sect. 4.4.1).

  8. 8.

    Personalization of explanations: Recognizing the diversity in users’ backgrounds, expertise, and needs, future research should focus on personalizing explanations as an important desideratum (see Sect. 3.3).

  9. 9.

    Explanation difficulty has received little attention. A more thorough understanding in the context of GenAI, quantifying the phenomenon and potentially even leading to techniques accounting for such difficulty, is another future avenue (see Sect. 4.3.4).

  10. 10.

    Dealing with the complex nature of outputs of GenAI (compared to simple classification in the context of pre-GenAI) remains under-explored. While interactive, user-driven investigation is one avenue in this regard (Sect. 4.2.3), isolating particular facets using mechanistic interpretability techniques focusing on circuits (Sect. 4.4.4) is another. In general, little is known as of today. For example, it is not clear what facets should be explained for text or images (see "Output Scope" in Sect. 4.2.1) and how to relate actual explanations to high-level objectives such as maximizing plausibility. For example, "What attributes should be explained so that an explanation appears plausible?" (Sect. 4.2.3).

  11. 11.

    Building GenXAI to target ethical, societal, and regulatory concerns, including contributing to the mitigation of biases and enhancing fairness by identifying and quantifying them through XAI techniques (see the second-to-last paragraph in this section).

  12. 12.

    GenXAI itself raises a number of critical issues with respect to ethical concerns. For example, in legal cases, an explanation of why someone performed an action can decide life and death. As such, explanations themselves need to be carefully assessed with respect to alignment criteria (Sect. 3.1).

As with any work, this paper adopts a specific perspective, emphasizing certain aspects while foregoing others. Specifically, we did not aim to be comprehensive in all regards. Given the vast array of works on GenAI and XAI, we decided to omit detailed aspects of XAI related to the evaluation and usage of explanations. Additionally, we did not reiterate all existing XAI techniques prior to GenAI; instead, we focused on novel aspects and methods for GenAI, referring to other surveys for this purpose. Regarding modalities, we concentrated on images and text, as these are currently the two most prevalent. However, other modalities like video and 3D content generation are quickly emerging, and audio-to-text (and vice versa) has been established for some time. We conceptualized existing techniques and discussed technical aspects, but a more mathematical treatment could be provided. As AI evolves rapidly, our work is only a snapshot in time. Our conceptualization is likely to evolve further, though we believe the uncovered dimensions will endure, albeit with enhancements and modifications. Additionally, GenAI (in combination with XAI) involves many ethical and societal implications, which we partially addressed in Sects. 3.2 and 3.1. However, we refrained from a detailed exploration. We chose to adopt a neutral stance, pointing out concerns rather than prescribing actions to address ethical issues. For example, the lack of access to commercial models hinders transparency but protects company know-how. As such, one might advocate for regulations to increase transparency or allow companies to keep model details secret to protect intellectual property. From an organizational perspective, greater transparency might also enhance user trust. Therefore, the preferred governance of GenAI transparency by companies or society is debatable (Schneider et al. 2024a). We do not take a particular stance in these debates.

In conclusion, GenXAI is a crucial area for AI. Given the rapid advancement of GenAI and its widespread implications for individuals, society, and economics, more work should be dedicated to this area-especially considering the many research gaps highlighted in our roadmap, such as interactivity and the scope of explanations. Our work contributes in this direction by thoroughly motivating the study of XAI for GenAI, structuring existing knowledge through an enhanced conceptualization of XAI, uncovering novel dimensions, and setting forth a research agenda that calls for a joint effort by the research community to address open issues that will hopefully contribute to the well-being of all.