Explainable Generative AI (GenXAI): a survey, conceptualization, and research agenda

Schneider, Johannes

doi:10.1007/s10462-024-10916-x

Explainable Generative AI (GenXAI): a survey, conceptualization, and research agenda

Open access
Published: 15 September 2024

Volume 57, article number 289, (2024)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Explainable Generative AI (GenXAI): a survey, conceptualization, and research agenda

Download PDF

Johannes Schneider¹

72 Accesses
Explore all metrics

Abstract

Generative AI (GenAI) represents a shift from AI’s ability to “recognize” to its ability to “generate” solutions for a wide range of tasks. As generated solutions and applications grow more complex and multi-faceted, new needs, objectives, and possibilities for explainability (XAI) have emerged. This work elaborates on why XAI has gained importance with the rise of GenAI and the challenges it poses for explainability research. We also highlight new and emerging criteria that explanations should meet, such as verifiability, interactivity, security, and cost considerations. To achieve this, we focus on surveying existing literature. Additionally, we provide a taxonomy of relevant dimensions to better characterize existing XAI mechanisms and methods for GenAI. We explore various approaches to ensure XAI, ranging from training data to prompting. Our paper provides a concise technical background of GenAI for non-technical readers, focusing on text and images to help them understand new or adapted XAI techniques for GenAI. However, due to the extensive body of work on GenAI, we chose not to delve into detailed aspects of XAI related to the evaluation and usage of explanations. Consequently, the manuscript appeals to both technical experts and professionals from other fields, such as social scientists and information systems researchers. Our research roadmap outlines over ten directions for future investigation.

What Does It Mean to Explain? A User-Centered Study on AI Explainability

VitrAI: Applying Explainable AI in the Real World

A Brief Review of Explainable Artificial Intelligence (XAI) Techniques

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Generative AI (GenAI) has shown remarkable capabilities that have shaken up the world on a broad basis—ranging from regulators (European Union 2023), educators (Baidoo-Anu and Ansah 2023), programmers (Sobania et al. 2023) onto medical staff (Thirunavukarasu et al. 2023). For businesses (Porter 2023), GenAI has the potential to unlock trillions of dollars annually (McKinsey & Company 2023). At the same time, it is said to threaten mankind (The Guardian 2023). These opposing views are a key drive for understanding and explaining GenAI. Generative AI, driven by foundation models, represents the next level of AI, capable of creating text, images, audio, 3D solutions, and videos (Schneider et al. 2024c; Gozalo-Brizuela and Garrido-Merchan 2023; Cao et al. 2023), all controllable by humans through textual prompts (White et al. 2023) (see Table 2 for examples of public GenAI systems). This advancement marks a significant shift from AI that primarily "recognizes" to AI that "generates." GenAI has shown unprecedented capabilities like passing university-level exams (Choi et al. 2021; Katz et al. 2024). It also achieves remarkable results in areas once considered unsuitable for machines, such as creativity (Chen et al. 2023a). It is accessible to everyone, as witnessed by commercial systems like ChatGPT (Achiam et al. 2023) and Dall-E (Betker et al. 2023; Ramesh et al. 2022). Early generative AI methods, like Generative Adversarial Networks (GANs), could also generate artifacts but were typically more difficult to control compared to modern models such as transformers and diffusion architectures.

Explainable AI for GenAI (GenXAI) techniques provide explanations that help understand AI outputs for individual inputs or the model as a whole. Traditionally, explanations have served various purposes, such as increasing trust and aiding in model debugging (Meske et al. 2022). The need for understanding AI is greater now than in pre-GenAI eras. For example, explanations can support the verifiability of generated content, helping combat one of GenAI’s major problems: hallucinations (as discussed in Sect. 3.1). Unfortunately, Explainable AI (even for pre-GenAI models) still faces several open problems despite numerous attempts to address them over the past few years (Longo et al. 2024; Meske et al. 2022). For example, a recent comparison (Silva et al. 2023) of methods on the impact of XAI on human-agent interaction found only a 20% difference in scores between the best (counterfactuals) and worst method (using probability scores), suggesting that complex methods offer limited benefits over simpler ones. Therefore, XAI techniques are still far from optimal. Other works have even described the “status quo in interpretability research as largely unproductive” (Räuker et al. 2023). Therefore, much work remains, and it is essential to understand current efforts to learn from and improve upon them, especially to mitigate high risks (The Guardian 2023) while leveraging opportunities (Schneider et al. 2024c).

Table 1 Nomenclature

Full size table

This research manuscript aims to make genuine progress in this direction. Our goal is not merely to list and structure existing XAI techniques; at this stage, more fundamental questions need addressing, such as identifying key challenges and desiderata for GenXAI. To this end, we opted for a narrative review methodology (King and He 2005) combined with a taxonomy development approach from the field of information systems (Nickerson et al. 2013).^{Footnote 1} Several surveys on XAI focus on the pre-GenAI era with a primary technical focus (Adadi and Berrada 2018; Zini and Awad 2022; Dwivedi et al. 2023; Schwalbe and Finzel 2023; Räuker et al. 2023; Saeed and Omlin 2023; Speith 2022; Minh et al. 2022; Bodria et al. 2023; Theissler et al. 2022; Guidotti et al. 2019; Guidotti 2022) and an interdisciplinary or social science focus (Miller 2019; Meske et al. 2022; Longo et al. 2024). Building upon these, we conduct a meta-survey to structure our methods, leveraging knowledge from the pre-GenAI era. Additionally, we uncover novel aspects related to GenAI that have not yet been covered. Many works have surveyed various aspects of GenAI (excluding XAI) (Xu et al. 2023; Lin et al. 2022; Xing et al. 2023; Yang et al. 2023b; Zhang et al. 2023a, c; Pan et al. 2023). We use such surveys for our technical background. Some sub-areas of GenAI, such as knowledge identification and editing (Zhang et al. 2024), use isolated XAI techniques as tools but do not aim to elaborate on them generally. While we could not find any review discussing XAI for GenAI, some research manuscripts take a holistic, partially opinionated view on XAI for large language models (LLMs) (Singh et al. 2024; Liao and Vaughan 2023) or explicitly survey XAI for LLMs (Zhao et al. 2023a; Luo and Specia 2024). None of the prior works provide a comprehensive list of desiderata, motivations, challenges for XAI for GenAI, and a taxonomy. Many of our novel aspects, in particular, cannot be found in prior works. Furthermore, even when focusing solely on LLMs, we differ considerably from prior works.

We start by presenting a technical background. To derive our contributions, we follow the outline in Fig. 1. We then provide motivation and challenges for XAI for GenAI, highlighting novel aspects that emerge with GenAI, such as its increased societal reach and the need for users to interactively adjust complex, difficult-to-evaluate outputs. From this, we derive desiderata, i.e., requirements that explanations should ideally meet, such as supporting interactivity and output verification. Next, we develop a taxonomy for existing and future XAI techniques for GenAI. To categorize XAI, we use dimensions related to the inputs, outputs, and internal properties of GenXAI techniques that distinguish them from pre-GenAI, such as self-explanation and different sources and drivers for XAI, like prompts and training data. Using the identified challenges and desiderata, the remainder of this manuscript focuses on discussing novel dimensions for GenXAI and the resulting taxonomy, as well as XAI methods in conjunction with GenAI. Finally, we provide future directions. Our key contributions include describing the need for XAI for GenAI, outlining desiderata for explanations, and developing a taxonomy for mechanisms and algorithms that includes novel dimensions for categorization (Table 2).

Table 2 Examples of in-/outputs for GenAI. For more examples, see (Gozalo-Brizuela and Garrido-Merchan 2023)

Full size table

2 Technical background

Here, we provide a short technical introduction to generative AI, covering key ideas on system and model architectures and training procedures. We restrict ourselves to text and image data to illustrate multi-modality. For video and audio, please refer to other surveys (for example (Selva et al. 2023; Zhang et al. 2023b)). Nomenclature is provided in Table 1.

2.1 System architectures

GenAI models can function as stand-alone applications with a simple user interface, allowing textual inputs or uploads, as seen with OpenAI’s ChatGPT(Achiam et al. 2023), and displaying responses. Thus, a system might be essentially one large model, where a model is almost exclusively based on deep learning taking an input processed by a neural network yielding an output. For multi-modal applications, systems that consist of an LLM and other generative models, such as diffusion models, are typically employed. However, GenAI-powered systems may involve external data sources and applications interacting in complex patterns, as illustrated in Fig. 2. An orchestration application may determine actions based on GenAI outputs or user inputs. For example, in ChatGPT-4, a user can include a term like “search the internet” in the prompt, implying that an Internet search is conducted first, and the retrieved web content is then fed into the GenAI model. The orchestration application is responsible for performing the web search and modifying the prompt to the GenAI model, e.g., enhancing it with an instruction like “Answer based on the following content:” followed by the retrieved web information.

2.2 GenAI model architectures

We discuss key aspects of the transformer architecture and diffusion models and briefly elaborate on other generative models.

2.2.1 Transformers

Transformers are the de-facto standard for LLMs, while GenAI models involving images may also use models like diffusion models, variational autoencoders (VAEs), and generative adversarial networks (GANs). The transformer model makes few assumptions about the input data, making it highly flexible. Data assumptions (priors) also help reduce the amount of data needed to train a model. Therefore, transformers often require more data than other models to achieve the same performance, although simpler models might never reach the same top-level performance. The transformer architecture (Vaswani et al. 2017) (Fig. 3) has many variations (Lin et al. 2022), mostly involving different implementations of individual elements, such as different types of positional embeddings (Dufter et al. 2022), different types of attention (de Santana Correia and Colombini 2022), or even replacing some components. For instance, Hyena (Poli et al. 2023) provides a drop-in replacement for attention based on convolutions. The goals of these adjustments are typically better performance and faster computation. For example, the original transformer requires quadratic run-time in the number of inputs, making it prohibitive for very long inputs. The vanilla transformer architecture (Fig. 3) consists of an encoder and a decoder, where the decoder processes the outputs of the encoder and the (shifted) targets. Consider a translation scenario, where the encoder takes a sentence in the source language, and the decoder generates one output word at a time in the desired language. Each generated word also becomes an input to the decoder for generating the next word. Decoder-only architectures, such as the GPT series (Radford et al. 2019; Achiam et al. 2023), lack an encoder. In contrast, encoder-only architectures lack a decoder and typically produce contextualized embeddings of single words or text fragments (e.g., BERT (Devlin et al. 2019)).

Both the encoder and the decoder take tokens as inputs. The embeddings are vectors in a latent space. Raw tokens can only be compared for equality, vectors in a latent space allow a more nuanced similarity computation and, potentially, the extraction of specific token attributes such as sentiment. Thus, many current GenAI systems use encoders to obtain vector embeddings and retrieve relevant information to enhance prompts by searching in a vector database (Lewis et al. 2020). A positional encoding is added to the text embedding, so the network knows the position of each word. After that, inputs are processed through multi-head attention. A single-head attention mechanism focuses on a specific aspect of the input (e.g., syntax or sentiment), which is then further processed through a Feed Forward Network (FFN). Attention can be masked for the decoder so it cannot access the actual or future targets it wants to predict. For example, in the common task of next-word prediction, the triangular matrix mask prevents the network from accessing the next word and the words following it.

2.2.2 Diffusion models

Diffusion models learn to reconstruct noisy data. They first distort inputs by repeatedly adding small amounts of noise until the image appears to be noise following, e.g., a Gaussian distribution one can sample from to generate images. They reverse the process for sample generation by taking “noise” as input and reconstructing samples. There are several mathematically intricate methods for diffusion models (Yang et al. 2023b). Here, we discuss the key steps of a prominent technique highly relevant for text-to-image generation: the Denoising Diffusion Probabilistic Model (DDPM) (Ho et al. 2020). In the forward pass (input distortion), DDPM acts as a Markov chain, meaning only the current state (or input) is relevant for the next output. For a given data distribution \(x_0 \sim q(x_0)\), DDPM produces output \(x_T\) in a sequence of T sequential steps by computing at step t: \(q(x_t|x_{t-1})=N(x_t;\sqrt{1-\beta _t}x_{t-1},\beta _t I)\). Thus, the overall output is \(q(x_T|x_0)=\prod _{t=1}^T q(x_t|x_{t-1})\). Here, N represents the Gaussian distribution, and \(\beta _t\) is a hyperparameter. The reverse pass for generation starts from \(p_{\theta }(x_T)\) and produces \(p_{\theta }(x_0)\), which should follow the true data distribution \(q(x_0)\). Therefore, the outputs of diffusion models are not easily controllable; additional input must be provided during the reconstruction process to guide the generation towards user-desired images, as discussed in Sect. 2.4.

2.3 Other generative models: VAEs, GANs

We also discuss generative models aside from diffusion models but refer to (Yang et al. 2023b) for a detailed comparison and in-depth elaboration. Generative Adversarial Networks (GANs) (Goodfellow et al. 2015) are trained using a generator that constructs an output from a random vector and a discriminator that aims to distinguish generated outputs \({\hat{x}}\) from actual samples of the true data distribution \(x \sim q(x)\). An autoencoder (Li et al. 2023b) comprises an encoder and a decoder. The encoder compresses a given input into a latent space, and the decoder attempts to reconstruct the input from this compressed representation. A Variational Autoencoder (VAE) constrains the latent space to a given prior distribution through a regularization term as part of the optimization objective. If the latent space follows a known distribution, sampling (i.e., sample generation) is facilitated.

2.4 Controlling outputs

Diffusion and generative models like VAEs and GANs produce high-quality outputs from random inputs. Controlling outputs is not straightforward and requires additional effort. Early techniques for controllable image synthesis generated outputs conditioned on class labels or by disentangling dimensions of the latent space so that altering a dimension corresponds to a human-interpretable operation (Chen et al. 2016). However, these approaches offer very limited ways to customize outputs. Text-to-image models offer more versatility. They typically encode text and use it as input for generation. Text encoding may be done using a frozen encoder (Saharia et al. 2022) or by modifying the text encoder during the training process (GLIDE (Nichol et al. 2022)). Text-to-image models require (image, text) pairs for training. The textual encoder can derive from the text of the (text-image) pairs only or from a broader corpus and larger models. Text-to-image models can either start generation from a low-dimensional space, such as Dall-E (Ramesh et al. 2022; Betker et al. 2023), or aim to construct directly from pixel space, such as GLIDE (Nichol et al. 2022). To illustrate, we briefly discuss Dall-E (Ramesh et al. 2022; Betker et al. 2023). Dall-E uses a multimodal contrastive model where image and text embeddings are matched. It also includes a text-to-image generator to bridge the gap between CLIP text and image latent spaces, which can be learned using a diffusion prior.

2.5 LLM training

LLM training comprises at least one and up to three phases. These phases differ in training methodology, goals, and data requirements.

Self-supervised pre-training The raw model is trained on vast amounts of data using self-supervised techniques, such as next word prediction (GPT-2) or tasks like predicting masked words and sentence order (BERT). The goal is to learn a flexible and broad representation of text that can serve as a foundation for many different tasks. The data can include various types of text, as described for LLama-2 (Touvron et al. 2023), such as code, Wikipedia articles, or parts of the internet (Common Crawl Foundation 2024).

Instruction tuning This phase adapts a pre-trained LLM through supervised fine-tuning (Lou et al. 2023; Zhang et al. 2023c). The term "instruction" is often used interchangeably with prompt, but instructions tend to be more explicit and directive with precise guidance. The goal is to improve performance on common use cases for LLMs, such as following short instructions. The training data can include task instructions, input (task instance), and desired outputs. Instruction tuning enhances performance on both seen and unseen tasks and across different scales of architectures (Longpre et al. 2023). It can also encode additional domain-specific knowledge, such as medical information (Singhal et al. 2023).

Alignment Tuning LLMs might not incorporate human values and preferences, producing harmful, misleading, and biased outputs (Shen et al. 2023; Weidinger et al. 2022). Alignment criteria can be diverse, covering helpfulness, honesty, and harmlessness (Ouyang et al. 2022). Training data typically comes from humans, who rank LLM-generated answers or produce their own answers. Occasionally, more powerful and already aligned LLMs produce training data (Taori et al. 2023). Training can involve fine-tuning the LLM using supervised learning with the alignment dataset. Alternatively, human feedback can be used to learn a reward model that predicts human scoring of an output. This reward model can adjust the LLM using reinforcement learning, where the LLM generates outputs and receives feedback from the reward model.

3 The importance, challenges and desiderata of GenXAI

This section motivates the need for using eXplainable Artificial Intelligence (XAI) for GenAI (Table 3), discusses its challenges (Table 4), and describes the important desiderata that explanations should fulfill (Table 5).

3.1 Importance of XAI for GenAI

Need to adjust outputsThe need for explainability in GenAI increases as humans control and tailor the generated outputs. Generative AI blurs the boundary between users and developers. It can be instructed to solve tasks using auxiliary knowledge (Lewis et al. 2020), as demonstrated by OpenAI’s GPT-4 store (OpenAI 2023a), which allows ordinary users without programming skills to build and offer applications leveraging uploaded knowledge and GPT-4. A new skill of prompt engineering has emerged, which users need to master (Zamfirescu-Pereira et al. 2023). Experts have identified XAI as a key requirement to support prompt engineering (Mishra et al. 2023). Users need to better understand how to control outputs, handle limitations, and mitigate risks (Weidinger et al. 2022). Therefore, stakeholders must understand GenAI to create solutions aligned with their preferences.

Need for output verificationGenAI, particularly LLMs, are known for many shortcomings, ranging from generating harmful, toxic responses to misleading, incorrect content (Weidinger et al. 2022). LLMs can be said to be "fluent but non-factual," making their verification even more challenging. Their outputs are generally untrustworthy and require some form of verification or validation (as acknowledged by regulators (European Union 2023)). Explanations provide a mechanism to identify errors, such as hallucinations, in outputs.

Increased reachGenAI is easy to access and use through a web interface, facilitating widespread adoption. ChatGPT was the fastest product to reach 100 million users and continues to grow rapidly (Porter 2023). It is used throughout society, including by vulnerable groups such as schoolchildren, the elderly with limited IT knowledge, and corporate employees.

High impact applications GenAI is a general-purpose technology with applications that may have severe immediate and long-term impacts. Users might seek advice for pressing personal problems (Shahsavar and Choudhury 2023) or use GenAI for educational purposes. In an educational context, using ChatGPT once and receiving a slightly biased response (towards gender, race, or minority) might have limited impact, but receiving such responses over a prolonged period could profoundly affect future generations. Aside from this long-term view, users might turn to GenAI like ChatGPT while in psychological distress and seeking immediate advice. ChatGPT is even known to outperform humans in emotional awareness (Elyoseph et al. 2023). Given such high-stakes applications, understanding GenAI becomes crucial.

Unknown ApplicationsSince generative AI can process any text, image, or other medium as input and output, it is difficult to anticipate all possible applications. Therefore, understanding the models more holistically to ensure they align with higher-level principles becomes increasingly important.

Difficult to evaluate automatically Simple strategies, such as counting the number of correct and incorrect answers (accuracy), provide only a limited picture of GenAI’s behavior, as many tasks yield responses that are difficult to score. For example, in summarization, human ratings of gold standards are worse than those of benchmarks, highlighting the challenge in designing benchmarks (Sottana et al. 2023). Therefore, a thorough systematic quantitative evaluation using classical input and output test datasets is difficult, increasing the demand for better model understanding to anticipate potential shortcomings, as automated tests are challenging and possibly insufficient.

Security and safety concerns Various safety and security problems exist with GenAI-based technologies, as seen with adversarial examples (Goodfellow et al. 2015). GenAI enables novel forms of attacks and abuse (Gupta et al. 2023), targeting large numbers of people through social engineering on a massive scale to manipulate elections. Additionally, individuals might leverage GenAI for malicious activities, such as receiving detailed instructions for performing terrorist attacks.

Accountability and legal concerns The need for accountability arises from multiple questions, often driven by legal concerns. GenAI has a complex supply chain of data providers and developers, meaning data might come from many sources, and multiple companies might be involved in building a GenAI system. These include those who create a foundation model (Schneider et al. 2024c), those who fine-tune it for specific tasks, and ultimately, users who adjust it through prompting. This makes accountability a challenging but necessary task, as GenAI systems have the potential to cause harm. Thus, the question of who is responsible and what causes harm becomes more relevant. This can lead to forensic questions like “Why did an AI trigger a certain action (Schneider and Breitinger 2023)? Was it the training, data, or model?” Even if it can be tied to one aspect, such as data, further questions arise, like, "Was it data from a third-party provider, public data, or data from users?" The need for accountability also arises from the use of copyrighted or potentially patented material. In lawsuits, a key question is whether a patent is valid because it is highly original. If GenAI can explain that the solution is not "a copy from an existing patent" but emerges through basic reasoning given "prior art," the judge might be more inclined to rule the patent invalid.

Table 3 Why is explainability important for GenAI?

Full size table

Table 4 Why is XAI more challenging for GenAI?

Full size table

3.2 Why is XAI more challenging for GenAI?

Lack of access Commercial GenAI models from large corporations such as OpenAI and Google are among the most widely used systems. However, users, researchers, and other organizations interested in understanding the generative process of specific artifacts or models cannot access the model internals and training data of these models. This limitation rules out many XAI approaches.

Interactivity Some tasks involving humans and GenAI are inherently interactive, such as negotiations between GenAI and humans (Schneider et al. 2023b). Therefore, explanations need to focus not just on the model but also on how the model impacts humans and vice versa throughout the interaction.

Complex systems, models, data, and training Understanding AI becomes increasingly difficult as models grow and process more training data. GenAI, based on very large foundation models, constitutes the largest AI models today, with hundreds of billions of parameters (Schneider et al. 2024c). These models also lead to novel, more complex AI supply chains. Pre-GenAI models were often built using company-internal data, possibly by fine-tuning a model trained on a moderate-sized public dataset such as ImageNet. GenAI systems are built using much larger datasets from various sources, including public and third-party providers. Often, a foundation model is further adjusted through fine-tuning or by retrieving information from external data sources (Lewis et al. 2020) to create a GenAI system. Thus, the final output of a GenAI system might be generated not only through a deep learning model but also by engaging with other tools such as code interpreters (Schick et al. 2024).

Complex outputs Generated artifacts are complex, typically consisting of thousands to millions of bits in an information-theoretic sense. Classical supervised learning models often produce just a single bit for binary decisions and at most a few dozen bits for other classifications, i.e., mostly less than a few million classes. Before GenAI, a classifier’s label constituted a single decision, while GenAI makes a multitude of decisions-one for each aspect of the artifact. Thus, textual outputs and images naturally lead to investigations of several aspects of the output, such as tone, style, or semantics (Yin and Neubig 2022). There are many possible questions about why an artifact exhibits a certain property.

Hard to evaluate explanations It is difficult to evaluate explanations, especially for function-grounded approaches, where GenAI itself is hard to evaluate. Function-grounded evaluation focuses on assessing an XAI method based on predefined benchmarks, for example, using an explanation to classify an object and determining whether the explanation matches the output of the model used to create it. As such benchmarks might not exist and are difficult to create for certain GenAI tasks, evaluating explanations becomes more challenging.

Diverse users A model might be utilized by a diverse set of users across all age and knowledge groups, covering many needs, similar to a search engine. In contrast, pre-GenAI systems are more often tailored to a specific user group and task.

Risk of ethical violations Even commercial GenAI models are known for producing potentially offensive, harmful, and biased content (Achiam et al. 2023). GenAI models also commonly self-explain, which raises the possibility that while the outputs might be ethical, the explanations could be offensive due to inadequate phrasing or visual depictions.

Technical shortcomings GenAI suffers from hallucinations and limited reasoning capability. Consequently, if GenAI models self-explain, these explanations are subject to the same shortcomings

Table 5 Overview of novel and emerging desiderata for GenXAI

Full size table

.

3.3 Desiderata of explanations for GenAI

There are also several novel and emerging aspects as well as a big change in the relevance of desiderata:

Verifiability has recently been discussed as an important aspect of explanations (Fok and Weld 2023). However, the issue that outputs of LLMs cannot be verified (due to hallucinations) was discussed years earlier (Maynez et al. 2020). Suppose explanations cannot be verified, and an explainee must trust the explanation. In that case, the consequence can be the rejection of a correct answer due to an incorrect explanation or failure to detect incorrect outputs. A key concern regarding verification is the effort required to verify. Efforts to understand and tailor explanations have been discussed for XAI in general, for example, in terms of time efficiency (Schwalbe and Finzel 2023).
Lineage ensures that model decisions can be traced back to their origins. It involves tracking and documenting data, algorithms, and processes throughout the AI model’s lifecycle. It is crucial for accountability, transparency, reproducibility, and, more generally, governance of artificial intelligence (Schneider et al. 2023a). It concerns the "who" and the "what," such as "Who provided the data or made the model?" and "What data or aspects thereof caused a decision?" While the latter is a well-known aspect of XAI, evidenced by sample-based XAI techniques, the former has not been significantly emphasized in the context of XAI. (Faubel et al. 2023) established data traceability as a requirement for Machine Learning Operations (MLOPs) in XAI for industrial applications. The need for lineage arises as GenAI supply chains become more complex, often involving multiple companies (Schneider et al. 2024c) rather than just a single one. Additionally, multiple lawsuits have been filed in the context of generative AI, for example, related to copyright issues (Grynbaum and Mac 2023). Regulators have also imposed stringent demands on AI providers (European Union 2023). Therefore, employing GenAI poses legal risks to organizations. Ensuring lineage-supported accountability can serve as risk mitigation.
Interactivity and personalization have been previously discussed in XAI (Schwalbe and Finzel 2023; Schneider and Handali 2019). However, as GenAI outputs and systems become more complex, the number of options for how and what to explain has increased drastically. While there is only one bit to explain in a binary classification system, it amounts to millions for a generative AI system that generates images. Thus, it is nearly impossible for a user to understand the reasons behind all possible details of an output. Depending on user preferences and the purpose of the explanation, some aspects might be more relevant to understand than others. This necessitates the user to engage with the system to obtain explanations and for the system to provide adequate explanations tailored to the user’s demand. As discussed in Sect. 4.2.3, systems supporting interactive XAI are increasingly emerging.
Dynamic explanations aim to automatically choose the explanation qualities and content (e.g., what output properties are explained) based on the sample, explanation objective, and possibly other information. For example, an explanation might elaborate on why a positive or negative tone was chosen for a generated text. Tonality might be discussed for some samples but not for others. When to include it should align with a specified objective, e.g., the explanation should satisfy the explainee (plausibility), explain the most surprising aspects of the generated artifacts, or explain the most impactful properties of the artifact on its consumers. The goal of dynamic explanations is to maximize the specified objectives while being free to choose the explanation content, including meta-characteristics such as what aspects are included in the explanation, its structure, etc.
Costs related to XAI might also become an emerging area. The economics of XAI have been elaborated in (Beaudouin et al. 2020), where concerns about the costs of implementing XAI (and transparency) are mentioned. For corporations, GenXAI adds the risk of leaking value. Competitors (or academics) might prompt a model to save on the costs of training data. For example, the Alpaca model (Taori et al. 2023) was trained on data extracted from one of OpenAI’s models, allowing the avoidance of the costly and time-consuming task of collecting data from humans. Using explanations, such as part of chain-of-thought (CoT) (Wei et al. 2022), can further enhance performance and thus add value.
Alignment criteria, such as helpfulness, honesty, and harmlessness (Askell et al. 2021), which are relevant for GenAI in general, also play a role for XAI. Some of these criteria partially overlap with existing ones, such as plausibility (with helpfulness) and faithfulness (with honesty). For instance, explanations should not be deceptive (Schneider et al. 2023c). Aspects such as harmlessness have received less attention, as explanations were commonly simpler, such as attribution-based explanations. Harmlessness implies that explanations do not contain offensive information or information about potentially dangerous activities (Askell et al. 2021). It also relates to security, discussed next:
Security is increasingly evolving as a desideratum. Explanations should not jeopardize the security of the user or the organizations operating the GenAI model. Providing insights into the reasoning process might facilitate attacks or be leveraged in competitive situations, leading to poorer outcomes for GenAI. For example, in a recent study on price negotiations between humans and LLMs (Schneider et al. 2023b), humans asked an LLM what decision criteria it used and then systematically exploited this knowledge to achieve better outcomes against the LLM. A human negotiator might not disclose such information. Thus, openness can be abused. “Security through obscurity” is one protection mechanism against attacks. Corporations might also aim to protect their intellectual property. For example, customer support employees (and GenAI models) might have access to some relevant information about a product to help customers but might not be allowed to obtain explanations on how the product is manufactured, as this might constitute a valuable company secret.
Uncertainty Understanding the confidence of outputs is an important aspect of XAI. While most works aim to explain a decision, explaining the uncertainty of a prediction has also garnered attention (Molnar 2020; Gawlikowski et al. 2023). Deep learning models, such as image classifiers, are known to be overconfident, and multiple attempts have been made to address this issue (Meronen et al. 2024; Gawlikowski et al. 2023). LLMs arguably take this a step further, as they often generate answers eloquently even if they are wrong, i.e., they are "fluent but non-factual." Still, LLMs have some (though not perfect) understanding of whether they can answer a question (Kadavath et al. 2022). They can also be enhanced with uncertainty estimation techniques (Huang et al. 2023c).

Prior research has extensively discussed the principles and desiderata of explanations. Here, we briefly summarize the key characteristics explanations should exhibit, drawing on prior surveys (Lyu et al. 2024; Schneider and Handali 2019; Schwalbe and Finzel 2023; Bodria et al. 2023; Guidotti 2022), and discuss these characteristics in the context of GenAI. Among the most important and well-known desiderata are:

Faithfulness (= fidelity = reliability) An explanation should accurately reflect the reasoning process of a model. Due to the complexity of GenAI models, a higher level of abstraction seems necessary to keep explanations comprehensible within a limited amount of time.
Plausibility (= persuasiveness = understandability) An explanation should be understandable and compelling to the target audience. Textual explanations, especially self-explanations, are often easier to understand compared to classical explanations such as SHAP values.
Completeness (= coverage) and minimality: An explanation should contain all relevant factors for a prediction (completeness) but no more (minimality). Complete coverage becomes less feasible with the growth of output, data, and model sizes. Personalized, interactive explanations that allow users to control explanations and XAI techniques that automatically select only interesting properties to be explained could be the way forward.
Complexity The total amount of conveyed information in an explanation, typically measured relative to an explainer’s knowledge, i.e., subjectively. This aspect gains importance as stakeholders become more diverse.
Input and model sensitivity and robustness Changes in the input (or model) that impact model outputs should also lead to changes in explanations (sensitivity). However, if changes in inputs do not alter model behavior, or changes in the model do not significantly alter processing and outputs, then explanations should not change disproportionately (robustness). This still holds for GenAI.

4 Taxonomy of XAI techniques for GenAI

Our taxonomy provides a scheme for classifying XAI mechanisms and algorithms that support the understanding of GenAI (Sect. 4.1), which we use to classify existing techniques (Sect. 4.4).

4.1 Dimensions of taxonomy

The key characteristics of our taxonomy are summarized in Table 6. We distinguish between the output (i.e., explanation) properties, input, and internal properties of GenXAI algorithms. Explanation properties characterize the outputs of XAI algorithms, i.e., the explanations, in terms of scope (what fraction of samples, attributes, and parts of the interaction they explain), modality (unimodal or multi-modal), and interactivity (whether the user can engage in obtaining additional explanations or tailor explanations). Input and internal properties relate to what the XAI algorithms require to produce explanations and how these explanations are obtained. While many ideas on structuring XAI are still valid in the context of GenAI, we focus primarily on novel dimensions such as the foundational source for XAI, which can be data, model, training, or prompt. The source forms the key mechanism or artifact leveraged by XAI techniques to generate explanations, as elaborated in Sect. 4.3.1. One might classify the first three sources (data, model, and training) under the category of intrinsic methods, as they impact the resulting model. However, data can also be extrinsic, particularly in Retrieval-Augmented Generation (RAG). Similarly, for prompts, Chain-of-Thought (CoT) encourages and guides XAI but also relies to some extent on training. If training data lacks any form of explanation, CoT prompting will not work. This is most evident in extreme cases where words like "because" are removed from the training data, implying that they will never be generated.

Table 6 Dimensions of our taxonomy for GenXAI algorithms

Full size table

4.2 Explanation properties

We elaborate on the dimensions related to outputs of XAI methods, i.e., explanations.

4.2.1 Scope

We discuss scope in terms of output, interaction and input scope. Often, scope (Schwalbe and Finzel 2023) only refers to what inputs are explained by a method, i.e., a single sample (local) versus all samples (global), meaning the model behavior in general (Guidotti et al. 2019; Bodria et al. 2023). Thus, the scope indicates the quantity of the input samples explained, which we call input scope. For GenAI, scope also refers to the quantity of the output that is explained, i.e., a single attribute of the output (focused) versus all attributes (holistic). We call this output scope. This is illustrated with an example in Fig. 4. As outputs are significantly more complex for GenAI, they present more options for questions. For example, why did the response contain certain information and not other information? Why was the sentiment of a generated sentence positive, neutral, or negative? While some of our methods touch on these questions, there is limited work overall in this direction. Similarly, there is limited understanding of how to relate training data to predictions beyond classical approaches such as influence functions, i.e., how a specific piece of knowledge in the training data impacted the generation process.

Interaction scope: For GenAI, interactions are commonly in the form of dialogues, making it more relevant to understand both sides of the system. Therefore, we distinguish two goals:

Explaining (single) input–output relations: This is the classical notion, where an AI system processes one input to produce one output. The goal can be to understand a particular instance (local explanation) or the model as a whole. While explanations might consider user-specific aspects and personalize explanations (Schneider and Handali 2019), such personalization is typically independent of the interaction.
Explaining the entire interaction: It focuses on human-AI interaction and its dynamics, for example, the communication and actions between an AI and a human when a human solves a task using an AI. Interactions are characterized by multiple rounds of outputs generated by both sides conditioned on prior outputs. In this case, the goal is to more holistically explain (i) the dynamics of the interaction, i.e., not just one single input–output but the entire sequence of in- and outputs and (ii) the outcome of the interaction, which could be why a particular artifact such as an image was generated in a certain way, but also why a user did not complete a task, for example, abandoned it prematurely or achieved an unsatisfactory output. Interaction dynamics are influenced by a series of technical (such as model behavior including classical performance measures but also latency, user interface, etc.) and non-technical factors (such as human attitudes and policies). As such, human-AI interaction cannot easily be associated with one scientific field but is inherently interdisciplinary. Explainability, which aims at understanding AI technology, should focus on how technical factors related to model behavior impact the interaction. While many existing works touch on the subject, the change in interactivity brought along by prompting due to GenAI is not well understood. A study that investigated interactivity in negotiations was (Schneider et al. 2023b). However, the explanations for outcomes of negotiations and interaction behavior are not through algorithms but rather through a manual, qualitative investigation of interaction—a common technique in social sciences but less prevalent in computer science.

The two goals are illustrated with an example in Fig. 5.

The field of human-computer interaction has aimed at explaining human interactions for a significant amount of time (MacKenzie 2024; Carroll and Olson 1988), with some effort also devoted to discussing human-AI interaction. For example, guidelines for human-AI interaction (before GenAI) are well-studied (Amershi et al. 2019), as is the topic of explanation in collaborative human-AI systems (Theis et al. 2023). Furthermore, explanations in human-AI systems typically encompass objectives driven by non-technical concerns (Miller 2019; Mueller et al. 2021). However, there is less work on explaining human-AI interactions themselves (Sreedharan et al. 2022), particularly targeted towards GenAI. To provide two examples from the pre-GenAI area spanning from in-depth technical study to broader organizational studies: (Schneider 2022) discussed how human-AI interaction could be optimized, accounting for long-term goals such as preserving human diversity. The work explains how the user can improve their interaction to reduce error rates and become more efficient. Grisold and Schneider (2023) investigated the dynamics of AI adoption within an organization, explaining how error rates and the learning behavior of an AI impact the complexity of processes. Explanations in interactive systems in pre-GenAI (Rago et al. 2021) and GenAI (see Sect. 4.2.3) are typically more concerned with supporting interactive explorations of a decision, for instance, querying a system to better understand it, rather than obtaining explanations on how a sequence of inputs and outputs emerged.

4.2.2 Explanation modality

Commonly, explanations are unimodal, such as textual, visual, or numeric. However, multi-modal explanations have also been investigated (Park et al. 2018) by collecting a dataset containing textual and visual justifications. This research shows that multi-modal explanations yield favorable outcomes, with each modality improving due to the presence of the other.

4.2.3 Dynamics

Interactivity In a classical setting, XAI is often non-interactive, meaning the explainee has limited options to control explanations or request additional information. However, the idea of interactivity in XAI has existed for some time, as shown in prior taxonomies (Schwalbe and Finzel 2023). One idea is to reconceptualize XAI as a dialogue (Singh et al. 2024). Slack et al. (2023) supports the explainability of a machine learning model by using an LLM to translate natural language queries into a predefined set of operations related to explainability, such as identifying the most important features or changing a feature. Gao et al. (2023) uses LLMs to create an interactive and explainable recommender system. In terms of classical metrics such as recall and precision, the system is not outperforming. Improving models due to human explanations in an interactive setting has also been studied for image recognition (Schramowski et al. 2020). While LLMs are known for their superior performance in many Natural Language Processing (NLP) tasks, they might also be employed without improving classical metrics such as accuracy, precision, or recall, but rather to provide other benefits, such as explainable, interactive systems requiring less training data (Gao et al. 2023; Deldjoo 2023). Deldjoo (2023) showed that binary risk classification can be performed with 40x less data to roughly match the performance of a classical machine learning system through explainable-guided prompts.

Static vs. dynamic explanation qualities and content relates to how the structure and content of explanation vary based on the sample due to the algorithm. This includes which properties of the generated artifact are explained and how they are explained. This choice can be independent of the sample (static), i.e., identical for each sample, or dependent on the sample itself (dynamic). For example, classical methods like attribution maps are static, as they only explain relevance and always assign a relevance value for each input pixel. In contrast, textual self-explanations can be dynamic. The explained properties can be selective, e.g., for a generated image, an explanation like "A moving, red car was shown to make the image more vivid and engage the viewer" only describes one part of the image and one property of that part. Explanations are dynamic if the structure and content of the explanation vary with the sample. For example, if the XAI technique yields a generated image of a mountain landscape: "A snowy landscape was chosen as the prompt contained the word ’bright.’" Explanations are dynamic if the two exemplary explanations refer to different objects (car vs. landscape), differ in explained properties (explaining emotional intention vs. not discussing it), and differ in causal factors discussed (mentioning what inputs caused output properties, i.e., "bright" in prompt vs. not doing so).

4.3 Input and internal properties

We elaborate on the dimensions related to the inputs and internals of XAI methods (in contrast to their outputs).

4.3.1 Foundational sources for XAI techniques

We consider the data, model, and optimization (training) and prompt the foundational sources for XAI methods. That is, each source can be modified or tailored to improve XAI.

Model-induced XAI refers to intrinsic XAI methods (also known as model-specific XAI (Guidotti et al. 2019)), where the design of the model is altered to foster explainability. With GenAI, novel intrinsic methods have emerged.

Using interpretable components Deep learning comprises multiple layers and components, such as activation functions and attention mechanisms. These components can be more or less complex, affecting the overall explainability of the model. For example, interpretable activation functions might replace conventional, less interpretable functions. SoLU (Elhage et al. 2022) is said to enhance interpretability. These functions are based on the idea that there are more features than neurons per layer, and, in turn, superposition yields additional features (Olah et al. 2020). The paper also provides evidence supporting this hypothesis. The interpretability makes some neurons more understandable but hides others. Thus, while the method claims to provide a net benefit overall, it is not without costs. Classical attention layers are also commonly used, though their value for XAI is debated (see Sec. 4.4.1).

Interpretable GenAI models through additional models Model interpretability can be achieved by combining LLMs with other models and often with external knowledge. For example, in (Chen et al. 2023c), an LLM is enhanced with a GNN and external knowledge to generate an explanation and prediction jointly. Creswell and Shanahan (2022) fostered explicit multistep reasoning by chaining responses of two fine-tuned LLMs; one performs selection, the other inference. Wang and Shu (2023) incorporates external knowledge using classical internet search (as done in commercial products such as Bing Chat). Additionally, it uses first-order logic to create easier-to-verify subclaims that jointly lead to the overall claim. It also generates explanations by querying the LLM. Vedula et al. (2023) developed and trained a decoder for faithful and explainable online shopping product comparisons.

Optimization Adjusting the optimization objective is a common technique to foster explainability. Common strategies involve disentangling latent dimensions (Ross et al. 2017) and training for XAI-relevant criteria, such as citing evidence (Menick et al. 2022). In the context of GANs, Chen et al. (2016) explicitly trained disentangled representations by maximizing the mutual information between a small subset of the latent variables and the sample. This approach allowed the so-called InfoGAN to separate writing styles from digit shapes on the MNIST dataset. Furthermore, in (Ross et al. 2017), explanations have been used to constrain training, aiming "to be right for the right reasons." Note that while disentanglement is commonly achieved through training objectives and regularization, special network architectures like capsule networks (Sabour et al. 2017) and others (see the section on Disentanglement in (Schwalbe and Finzel 2023)) can also support disentanglement. Diffusion models already have a semantic latent space (Kwon et al. 2022). A special reverse process can leverage this for image editing using CLIP (Ramesh et al. 2022), which iteratively improves reconstructed images. (Menick et al. 2022) trains multiple models to cite evidence for claims, facilitating answer understanding and, especially, fact-checking.

Training data can support XAI in multiple ways. While the exact impact of training data and dynamics is not yet fully explored (Teehan et al. 2022), current findings indicate that training data is a crucial factor in supporting explainability, particularly for textual data.

Training data composition: GenAI training data is usually complex, even for a single modality. For example, text data might consist of programming code, books, dialogues, etc. The composition of the training data can impact reasoning, i.e., code can improve certain reasoning tasks (Ma et al. 2023). While less explicit statements are known for explanations, the overall data composition is also likely to impact XAI.

Explanation quality and quantityAside from the overall composition of training data, the presence and absence of explanations in the training data also impact XAI. While it is well-known that the quality of training data strongly impacts a model’s performance, the impact on XAI has received less attention. However, as GenAI can self-explain, as seen with CoT (Wei et al. 2022), i.e., generate explanations as part of outputs, training data strongly impacts performance. For illustration, assume that a model is trained with erroneous explanations but potentially correct results; it could perform well on tasks but provide poor explanations. Additionally, if the training data does not contain any data explaining the reasoning, the model might perform worse at explanations. In an extreme case, if the training data does not contain explanations, including explanatory words such as "due to," "for that reason," and "because," the model will also never generate such words (and the corresponding explanations).

Domain specialization (Ling et al. 2023) can tailor the model more toward a specific domain, potentially at the cost of abilities in other domains (Chen et al. 2020). It can also contribute towards explainability, as a narrower model focus implies fewer potential options for explanations and, thus, a lower risk of errors and easier verifiability.

Prompts can also induce explanations: LLMs can be prompted (i) to provide explanations in a preferred manner or (ii) be constrained to rely on a limited set of given facts to create responses, facilitating verification. The idea of "rationalization" dates back to the early 2000 s (Zaidan et al. 2007). One type of explainability is “justifying a model’s output by providing a natural language explanation” (Gurrapu et al. 2023). Commonly, explanations clarifying the reasoning process are elicited through the use of chain-of-thought (CoT) prompting (Wei et al. 2022). Such prompting allows the structure of the reasoning process, implicitly shaping explanations and utilizing external knowledge for each reasoning step to yield more faithful explanations (Wei et al. 2022; He et al. 2022). However, explanations can also contain hallucinations, as shown in the context of few-shot prompting (Ye and Durrett 2022) and sensitivity to inputs for CoT prompting (Turpin et al. 2024). Despite their unreliability (due to a lack of faithfulness in explaining the true inner workings), they might still be valuable for verifying output correctness or in training (smaller) models (Zhou et al. 2023). Additionally, LLMs can be constrained to rely on information from the user-given prompt or extracted from an external database or the internet, a process known as Retrieval-Augmented Generation (RAG) (Lewis et al. 2020). RAG facilitates understanding and verifying the response of an LLM, as the source of information for the answer is known and typically much smaller than the entire training data of a GenAI model. To facilitate explainability for recommender systems, personalized prompt learning, such as soft-prompt tuning to yield vector IDs, has been conducted (Li et al. 2023a).

4.3.2 Required model access by XAI method

Depending on what is to be explained and the XAI method used, different information is needed to obtain explanations (Schwalbe and Finzel 2023). Black-box access does not provide information on likelihoods other than the predicted output, and even such probabilities might not be available. That is, internal access, such as to activations and gradients, is commonly unavailable. White-box access refers to having complete access to the model, its training data, and its training procedure. In between, there is a wide range of grey-box access (Schneider and Breitinger 2023). An important restriction for XAI techniques investigating commercial models is that these models are typically only accessible as a black box, such as GPT-4 through an API. Similarly, commercial vendors do not share their training data and often do not even disclose a basic summary of the training dataset. Thus, XAI techniques leveraging training data are not easily employable.

4.3.3 Model (self-)explainers

GenAI, particularly LLMs, provide explanations for their own decisions (self-explain) or serve as explainers in general. That is, the model itself provides explanations rather than relying on a dedicated XAI technique. This contrasts with the classical notion of intrinsic XAI, which often denotes understandable, simple models such as decision trees and linear regression. While self-explanations are not necessarily faithful (Turpin et al. 2024), attempts have been made to improve their accuracy. For example, Chuang et al. (2024) proposed to use an evaluator to quantify faithfulness and optimizing faithfulness scores iteratively. Though self-explanations are not necessarily accurate or faithful, they can be helpful, as demonstrated in a complex environment where agents performed multistep planning and improved through self-explanation (Wang et al. 2023). LLMs can serve as explainers by providing explanations for essentially anything. For instance, they can self-explain by generating explanations tailored to their outputs (Wei et al. 2022; Turpin et al. 2024), support explaining other machine learning models (OpenAI 2023b; Slack et al. 2023; Singh et al. 2023), provide explanations by analyzing patterns in data through autoprompting (Singh et al. 2022), support people in self-diagnosis (Shahsavar and Choudhury 2023), or yield interpretable autonomous driving systems (Mao et al. 2023). In fact, on free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowd workers’ gold references (Ziems et al. 2023). For mental health analysis, the quality of LLM explanations approaches that of human explanations (Yang et al. 2023a).

4.3.4 Explanation sample difficulty

Not all input samples are equally challenging to explain (Saha et al. 2022). The idea that some samples and interactions are easier to explain than others gains relevance as the variance in possible inputs and outputs increases. For example, LLMs allow us to ask the simplest to the most complex questions from a human perspective. LLM-generated texts might range from a "lookup" of a fact learned from the training data to long stories or solutions to complex tasks. Explanations (as judged by humans) might be poorer for difficult samples than for simple ones. More precisely, this has been observed when explaining data labels with GPT-3, where GPT-3 explanations degraded much more with example difficulty than human explanations (Saha et al. 2022). Generally, the idea of distinguishing XAI tasks based on their difficulty has received little attention so far. From a computational perspective, current explanation algorithms require an identical amount of computation regardless of the difficulty. Also, forward computations typically proceed identically regardless of sample difficulty, though the notion of reflection (and thinking fast and slow) has been mentioned in the literature (Schneider and Vlachos 2023b). LLMs can benefit from ensemble methods, such as combining multiple outputs into a single output (Huang et al. 2023b). There are numerous works on self-correction (Pan et al. 2023). For example, self-debugging using self-generated explanations as feedback has been linked to reducing coding errors by LLMs (Chen et al. 2023b). However, the ability of LLMs to self-correct reasoning by "reflecting" on their responses without additional information has also been questioned (Huang et al. 2023a). While state-of-the-art systems like GPT-4 improve scores on causal reasoning benchmarks, which can serve as explanations, they also exhibit unpredictable failure modes (Kıcıman et al. 2023).

4.3.5 Dimensions of Pre-GenAI

There is also a long list of concepts relevant to our taxonomy that originate from pre-GenAI (Adadi and Berrada 2018; Zini and Awad 2022; Dwivedi et al. 2023; Schwalbe and Finzel 2023; Räuker et al. 2023; Saeed and Omlin 2023; Speith 2022; Minh et al. 2022), which we will not elaborate on in detail. For example, XAI methods can be classified according to what is to be understood, which includes the system, model-related information such as representations (layers, vectors, embeddings) and outputs, training dynamics, and the impact of data. This distinction was already made before GenAI (Schwalbe and Finzel 2023). The most common ways to structure XAI techniques in prior works are, unfortunately, not conceptually clean. For example, a common distinction is between mechanistic and feature attribution-based techniques. However, conceptually, feature attribution relates to how the explanation looks (i.e., relevance scores for output). Meanwhile, mechanistic interpretability aims more at what is being investigated (i.e., neurons and interactions) and how the techniques work (through reverse engineering). As the names of the existing categories are well-established and thus easy for readers to comprehend, we shall use them in our classification of techniques shown in the next section while also discussing classification based on our novel dimensions shown in Table 6.

4.4 Classification of techniques

We categorize XAI techniques into four groups commonly found in existing literature (see prior surveys in Sect. 1). Figure 6 provides an overview of these techniques. Our focus is on techniques developed specifically for GenAI models or classical techniques that have been adjusted for GenAI (mostly by addressing computational issues) or could be employed without significant changes. We also structure existing techniques in terms of our novel dimensions (Sect. 4.4.5).

4.4.1 Feature attribution

Feature attribution assigns a relevance score to each input feature, such as a word or pixel.

Perturbation-based techniques partially alter inputs, for example, by removing or changing features and investigating output changes. In the context of NLP, alterations such as removing tokens (Wu et al. 2020), as well as negating and intensifying statements (Li et al. 2016), have been investigated.

Gradient-based methods require a backward pass from outputs to inputs to obtain derivatives. While not all gradient-based techniques work reliably (Adebayo et al. 2018; Ghorbani et al. 2019), some techniques like Grad-CAM (Selvaraju et al. 2017), which compute a function of gradients and activations, have proven valuable at the pixel level in images and for token-level attribution (Mohebbi et al. 2021). Directional gradients have also been used in NLP models (Sikdar et al. 2021; Enguehard 2023). Integrated gradients have been used to attribute knowledge to internal neurons (Lundstrom et al. 2022; Dai et al. 2022), as discussed under "Neuron activation explanation." Simple first derivatives concerning embedding dimensions are shown in (Li et al. 2016).

Surrogate models approximate large models using much simpler models, often to understand individual predictions. Classical methods include LIME (Ribeiro et al. 2016) and SHAP (Lundberg and Lee 2017), which have been adapted for transformers (see (Kokalj et al. 2021) for SHAP). Furthermore, attention flows in NLP models have been shown to relate to SHAP values (Ethayarajh and Jurafsky 2021). Explain Any Concept (EAC) (Sun et al. 2024) presents an approach for concept explanation, utilizing the Segment Anything Model (SAM) (Kirillov et al. 2023) for initial segmentation and introducing a surrogate model to enhance the efficiency of the explanation process. SAM excels at producing object masks from input prompts like partial masks, points, or boxes. It can generate masks for all objects in an image. SAM is trained on a vast dataset that includes 11 million images and 1.1 billion masks.

Decomposition-based methodsDecomposition traditionally refers to attributing relevance from outputs to inputs or decomposing vectors, but in the context of GenAI, it can also refer to decomposing the reasoning process and attributing outputs to specific reasons. Liu et al. (2022) aims to explain the reasoning process for question answers using entailment trees constructed through reinforcement learning. An entailment tree has a hypothesis as its root, reasoning steps as intermediate nodes, and facts as leaves. Common decomposition techniques compute relevance scores on a layer-by-layer level so that contributions of upper layers emerge as a combination of lower-level contributions. A classic example is Layer-wise relevance propagation (LRP) (Montavon et al. 2019). Decomposition-based methods have also been applied to transformers (Ali et al. 2022). Ali et al. (2022) claimed that their adaptation of LRP mitigates shortcomings of gradient methods, which are said to arise due to layernorm and attention layers. Linear decomposition has also been suggested for local interpretation for transformers (Yang et al. 2023c), where a decomposition is considered interpretable if it is orthogonal and linear. Decompositions are often vector-based (Luo and Specia 2024). They express a vector (such as a token embedding) in terms of more elementary vectors. For example, Modarressi et al. (2023) decomposed token vectors and propagated them through the network while maintaining accurate attribution. Zini and Awad (2022) surveyed XAI methods for word embeddings. One idea is to express embedding vectors in terms of an orthogonal basis of interpretable basis (concept) vectors; another technique employs external knowledge and sparsification of dense vectors.

Attention-basedAttention is a key element within neural networks, providing an importance score for inputs, which are not necessarily the initial inputs to the network but those of a prior layer. For LLMs, attention scores are commonly obtained between all input token pairs for a single attention layer and can be visualized using a heatmap or bipartite graph (Vig 2019). Relevance scores have also been computed by combining attention information with gradients (Barkan et al. 2021). Attention-based methods have been scrutinized because they might not identify the most relevant features for predictions (Serrano and Smith 2019; Jain and Wallace 2019). However, the debate remains unsettled. Stremmel et al. (2022) focus on explaining language models for long texts by leveraging sparse attention and developing a masked sampling procedure to identify text blocks contributing to a prediction. Some listed techniques leverage several ideas. For instance, Modarressi et al. (2023) can be considered both a vector-based and decomposition-based method.

4.4.2 Sample-based

Sample-based techniques investigate output changes for different inputs. In contrast to perturbation-based methods that selectively change individual features to investigate their impact, sample-based techniques focus more on the entire sample to understand the relationship between various inputs and their corresponding outputs rather than attributing the output to specific features within a single input.

“Training data influence” measures the impact of a specific training sample on the model, typically on the output for a particular input. Grosse et al. (2023) addressed computational issues to employ influence functions for LLMs. Explainability has been transferred from a large natural language inference dataset to other tasks (Yordanov et al. 2021).

Adversarial samples are input alterations due to small, hard-to-perceive changes for humans that lead to a change in outputs. They are typically discussed in the context of cybersecurity, where an attacker aims to alter model outputs while a human should not notice the input change. However, schemes that aim to "trick" humans and classifiers alike have also been proposed (Schneider and Apruzzese 2023). For example, SemAttack (Wang et al. 2022a) perturbs embeddings of BERT tokens, while other attacks exchange words (Jin et al. 2020). Parts of inputs can also be occluded to better understand model behavior (Schneider and Vlachos 2023a).

Counterfactual explanations seek to identify minimal changes to an input so that the output changes from a class y to a specific class \(y'\). In contrast to adversarial samples, changes can be noticed by humans. For example, GPT-2 has been fine-tuned to provide counterfactuals based on pairs of original and perturbed input sentences (Wu et al. 2021). Exploring LLM capabilities through counterfactual task variations, Wu et al. (2023) has shown that LLMs commonly rely on narrow, context-specific procedures that do not transfer well across tasks. Augustin et al. (2022); Jeanneret et al. (2022) use diffusion models guided by classifiers to create counterfactual explanations.

Contrastive explanations explain why a model predicted y rather than \(y'\). Contrastive explanations are said to better disentangle different aspects (such as part of speech, tense, semantics) by analyzing why a model outputs one token instead of another (Yin and Neubig 2022).

4.4.3 Probing-based

Probing-based methods aim at understanding what knowledge an LLM has captured through "queries" (probes). A classifier (probe) is commonly trained on a model’s activations to distinguish different types of inputs and outputs.

Knowledge-based For example, encoders such as BERT, MiniXX, and T5 that produce vectors can be probed by training a classifier on their outputs to identify the presence of properties or abilities that emerge from the inputs, such as syntax knowledge (Chen et al. 2021) and semantic knowledge (Tenney et al. 2019). Alternatively to using classifiers, datasets focusing on specific aspects such as grammar (Marvin and Linzen 2018) can be created. The model’s performance on the dataset indicates its ability to capture the property. The design of datasets requires care, as regularities might provide an opportunity for shortcut learning (Zhong et al. 2021), which foregoes learning the properties in favor of identifying such dataset-specific regularities. Hernandez et al. (2023) learns how to map statements in natural language to fact encodings (in an LLM’s representation). In turn, this allows a new way to detect (and explain) when LLMs fail to integrate information from context. The research argues that untruthful texts result from not integrating textual information into specific internal representations. For example, Liu et al. (2021) investigated the training of RoBERTa over time through probing. They found that local information, such as parts of speech, is acquired before long-distance dependencies such as topics. Goyal et al. (2022) analyze the learning of text summarization capability of a large language model by obtaining summaries and output probabilities for a fixed set of articles at different points during training. They show n-gram overlap between generated summaries and original articles over time, concluding, for example, that models learn to copy early during training (leading to high overlap), which decreases over time (decrease in overlap).

Concept-based explanation Typically, given a set of concepts, concept-based explanations provide relevance scores for these concepts within inputs (Kim et al. 2018). More recently, it has also been proposed to uncover concepts based on what input information is still present at specific layers (or embeddings) (Schneider and Vlachos 2023a). While the latter investigates images, high-impact concepts as a source of explanation have also been used for LLMs (Zhao et al. 2023b). Foote et al. (2023) interprets a large set of individual neurons by constructing a visualizable graph using the training data and truncation and saliency methods. However, concept-based methods must be designed carefully, as merely investigating interactions among input variables might be insufficient to show that symbolic concepts are learned (Li and Zhang 2023).

Neuron activation explanationIndividual neurons can also be understood using their activations for inputs. Recently, GPT-4 has been used to generate textual explanations for individual neuron activations of GPT-2 (OpenAI 2023b). For example, GPT-4 summarizes the text that triggers large activations for a neuron. Dai et al. (2022) uncovered "knowledge neurons" that store particular facts. They performed knowledge attribution by setting neuron weights to 0 and then increasing them to their original value while summing up the gradient. If the neuron is relevant for a particular fact, the sum should be large.

4.4.4 Mechanistic interpretability

Mechanistic interpretability investigates neurons and their interconnections, aiming to reverse-engineer model components into human-understandable algorithms (Olah 2022). Models can be viewed as graphs (Geiger et al. 2021), and circuits (i.e., subgraphs) can be identified that yield certain functionality (Wang et al. 2022b). Common approaches fall into three categories (Luo and Specia 2024): circuit discovery, causal tracing, and vocabulary lens. The typical workflow to discover a circuit is often manual and involves (Conmy et al. 2024): (i) Observing a behavior (or task) of a model, creating a dataset to reproduce it, and choosing a metric to measure the extent to which the model performs the task; (ii) Defining the scope of interpretation (for example, the layers of the model); and (iii) Performing experiments to prune connections and components from the model. Circuit-based analysis can also focus on specific architectural elements. For instance, feedforward layers have been assessed and associated with human-understandable concepts (Geva et al. 2022). Additionally, two-layer attention-only networks have been investigated, leading to conjectures about how in-context learning might work (Olsson et al. 2022). Recent work has automated the process of finding connections between abstract neural network units that constitute a circuit (Conmy et al. 2024). Modern causal tracing commonly estimates the impact of intermediate activations on their output (Meng et al. 2022). While causal tracing moves from activations to outputs, the vocabulary lens focuses on establishing relations to the vocabulary space. For example, Geva et al. (2022) project weights and hidden states to the vocabulary space. Individual tokens have also been assessed (Ram et al. 2022; Katz and Belinkov 2023). Katz and Belinkov (2023) create information flow graphs showing human-readable tokens based on processed vectors within attention heads and memory values. These vectors are mapped to tokens.

4.4.5 Structuring based on novel dimensions

We also classify existing techniques based on uncovered characteristics, as shown in Table 6. Regarding scope, no existing technique explicitly focuses on the entire interaction. However, LLMs providing self-explanations could be used for that purpose. When it comes to explaining one, multiple, or all properties of the output, a feature attribution map typically explains at most one attribute of the output. For example, by highlighting positive and negative words or phrases, sentiment might be explained. Sample-based techniques can potentially explain multiple attributes. For example, if the chosen samples share different characteristics, such as images showing the same object but in different poses and colors, it can be concluded that the latter two attributes are irrelevant. Mechanistic interpretability and probing often aim to isolate a single concept, but this is not a requirement. Existing XAI techniques are typically non-dynamic. Explanations from LLMs can be interactive, personalized, and often sample-dependent. Table 7 shows a concept matrix linking the foundational source to XAI methods. Feature attribution techniques are typically posthoc methods that leverage the model to generate explanations. However, LLMs could also be prompted to generate feature attribution explanations. Sample-based techniques commonly rely on the training data and the model, for example, to determine which samples most strongly activate a particular neuron. While mechanistic approaches commonly rely on a dataset, it does not have to be from the training data. Probing can be performed naturally through prompts and other forms of input, followed by analyzing the corresponding outputs.

Table 7 Mapping of XAI categories and selected techniques to foundational sources for XAI

Full size table

Most existing techniques require white-box access 8. It might be possible to gain limited insights using black-box access only, for example, using occlusion (or masking) for feature attribution techniques. It might be investigated if predictions change, but such an approach typically yields coarser and less accurate explanations compared to having access to the output probabilities (Table 8).

Table 8 Mapping of XAI categories and selected techniques to required access by XAI techniques

Full size table

5 Research Agenda of XAI for GenAI, Discussion and Conclusions

As the field of XAI for GenAI is quickly emerging, there are many opportunities for research covering more technical algorithmic avenues as well as economic and psychological aspects. Bridging the gap between AI research and other disciplines, such as cognitive science, psychology, and the humanities, is likely the way forward for many topics. For instance, evaluating explanations often requires expertise beyond AI, such as domain expertise demonstrated in studies involving product engineers (Johny et al. 2024) or psychologists who provide theoretical frameworks and insights on human behavior, e.g., when assessing the suitability of LLM outputs for children (Schneider et al. 2024b). Furthermore, policymakers and AI experts need to collaborate to develop regulations that are both implementable and effective in limiting risks due to legal uncertainties for companies (Walke et al. 2023; Schneider et al. 2024a). More generally, possible directions include:

1.
Explaining interactions rather than single input–outputs is particularly interesting for interdisciplinary research conducted in fields such as information systems and human-computer interaction, as it requires an understanding of humans, models, and systems (see "Interaction Scope" in Sect. 4.2.1).
2.
Real-time and interactive explanations based on user queries and feedback could be further explored, necessitating insights from multiple disciplines (Sect. 4.2.3).
3.
Multimodal GenXAI: Currently, most explanations are of a single modality, commonly either text or visual. There is a lack of techniques providing explanations using more than one modality (Sect. 4.2.2).
4.
Adding XAI to novel directions of GenAI, such as GenXAI for video, 3D content generation, and actions (see Table 2), is urgently needed.
5.
Deriving novel XAI techniques, particularly in the field of mechanistic interpretability (Sect. 4.4.4), which investigates the inner workings of GenAI models, is a promising yet challenging frontier.
6.
Addressing verifiability and hallucinations using AI: Hallucinations are one of the biggest challenges of GenAI. GenXAI can help mitigate them using techniques such as Chain-of-Thought, which explains the reasoning process, but more techniques are needed.
7.
Porting pre-GenAI XAI techniques by addressing computational concerns of classical techniques to apply them to large models, as was done for techniques like SHAP (see Sect. 4.4.1).
8.
Personalization of explanations: Recognizing the diversity in users’ backgrounds, expertise, and needs, future research should focus on personalizing explanations as an important desideratum (see Sect. 3.3).
9.
Explanation difficulty has received little attention. A more thorough understanding in the context of GenAI, quantifying the phenomenon and potentially even leading to techniques accounting for such difficulty, is another future avenue (see Sect. 4.3.4).
10.
Dealing with the complex nature of outputs of GenAI (compared to simple classification in the context of pre-GenAI) remains under-explored. While interactive, user-driven investigation is one avenue in this regard (Sect. 4.2.3), isolating particular facets using mechanistic interpretability techniques focusing on circuits (Sect. 4.4.4) is another. In general, little is known as of today. For example, it is not clear what facets should be explained for text or images (see "Output Scope" in Sect. 4.2.1) and how to relate actual explanations to high-level objectives such as maximizing plausibility. For example, "What attributes should be explained so that an explanation appears plausible?" (Sect. 4.2.3).
11.
Building GenXAI to target ethical, societal, and regulatory concerns, including contributing to the mitigation of biases and enhancing fairness by identifying and quantifying them through XAI techniques (see the second-to-last paragraph in this section).
12.
GenXAI itself raises a number of critical issues with respect to ethical concerns. For example, in legal cases, an explanation of why someone performed an action can decide life and death. As such, explanations themselves need to be carefully assessed with respect to alignment criteria (Sect. 3.1).

As with any work, this paper adopts a specific perspective, emphasizing certain aspects while foregoing others. Specifically, we did not aim to be comprehensive in all regards. Given the vast array of works on GenAI and XAI, we decided to omit detailed aspects of XAI related to the evaluation and usage of explanations. Additionally, we did not reiterate all existing XAI techniques prior to GenAI; instead, we focused on novel aspects and methods for GenAI, referring to other surveys for this purpose. Regarding modalities, we concentrated on images and text, as these are currently the two most prevalent. However, other modalities like video and 3D content generation are quickly emerging, and audio-to-text (and vice versa) has been established for some time. We conceptualized existing techniques and discussed technical aspects, but a more mathematical treatment could be provided. As AI evolves rapidly, our work is only a snapshot in time. Our conceptualization is likely to evolve further, though we believe the uncovered dimensions will endure, albeit with enhancements and modifications. Additionally, GenAI (in combination with XAI) involves many ethical and societal implications, which we partially addressed in Sects. 3.2 and 3.1. However, we refrained from a detailed exploration. We chose to adopt a neutral stance, pointing out concerns rather than prescribing actions to address ethical issues. For example, the lack of access to commercial models hinders transparency but protects company know-how. As such, one might advocate for regulations to increase transparency or allow companies to keep model details secret to protect intellectual property. From an organizational perspective, greater transparency might also enhance user trust. Therefore, the preferred governance of GenAI transparency by companies or society is debatable (Schneider et al. 2024a). We do not take a particular stance in these debates.

In conclusion, GenXAI is a crucial area for AI. Given the rapid advancement of GenAI and its widespread implications for individuals, society, and economics, more work should be dedicated to this area-especially considering the many research gaps highlighted in our roadmap, such as interactivity and the scope of explanations. Our work contributes in this direction by thoroughly motivating the study of XAI for GenAI, structuring existing knowledge through an enhanced conceptualization of XAI, uncovering novel dimensions, and setting forth a research agenda that calls for a joint effort by the research community to address open issues that will hopefully contribute to the well-being of all.

Data availability

No datasets were generated or analysed during the current study.

Notes

Details on the methodology are given in the Appendix (Sect. 6.1).

References

Achiam J, Adler S, Agarwal S, et al. (2023) Gpt-4 technical report. arXiv preprint arXiv:2303.08774
Adadi A, Berrada M (2018) Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6:52138–52160
Article Google Scholar
Adebayo J, Gilmer J, Muelly M, et al. (2018) Sanity checks for saliency maps. Adv Neural Inf Process Syst 31
Ali A, Schnake T, Eberle O, et al. (2022) XAI for transformers: Better explanations through conservative propagation. In: Proceedings of the 39th international conference on machine learning, pp 435–451
Amershi S, Weld D, Vorvoreanu M, et al. (2019) Guidelines for human-ai interaction. In: Proceedings of the CHI conference on human factors in computing systems, pp 1–13
Askell A, Bai Y, Chen A, et al. (2021) A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861
Augustin M, Boreiko V, Croce F et al. (2022) Diffusion visual counterfactual explanations. Adv Neural Inf Process Syst 35:364–377
Google Scholar
Baidoo-Anu D, Ansah LO (2023) Education in the era of generative artificial intelligence (AI): understanding the potential benefits of ChatGPT in promoting teaching and learning. J AI 7(1):52–62
Article Google Scholar
Barkan O, Hauon E, Caciularu A, et al. (2021) Grad-sam: explaining transformers via gradient self-attention maps. In: Proceedings of the ACM international conference on information & knowledge management, p 2882-2887
Beaudouin V, Bloch I, Bounie D, et al. (2020) Flexible and context-specific AI explainability: a multidisciplinary approach. arXiv preprint arXiv:2003.07703
Betker J, Goh G, Jing L, et al. (2023) Improving image generation with better captions. Comput Sci 2(3):8. https://cdn.openai.com/papers/dall-e-3.pdf
Bodria F, Giannotti F, Guidotti R et al. (2023) Benchmarking and survey of explanation methods for black box models. Data Mining Knowl Discov 37(5):1719–1778
Article MathSciNet Google Scholar
Brooks T, Peebles B, Holmes C, et al. (2024) Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators. Accessed on 15 Mar 2024
Cao Y, Li S, Liu Y, et al. (2023) A comprehensive survey of AI-generated content (aigc): A history of generative AI from GAN to ChatGPT. arXiv preprint arXiv:2303.04226
Carroll JM, Olson JR (1988) Mental models in human-computer interaction. Handbook of human-computer interaction, pp 45–65
Chen X, Duan Y, Houthooft R et al. (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. Adv Neural Inf Process Syst 29:2180–2188
Google Scholar
Chen L, Sun L, Han J (2023) A comparison study of human and machine generated creativity. J Comput Inf Sci Eng 23(5):051012
Article Google Scholar
Chen B, Fu Y, Xu G, et al. (2021) Probing BERT in hyperbolic spaces. arXiv preprint arXiv:2104.03869
Chen S, Hou Y, Cui Y, et al. (2020) Recall and learn: Fine-tuning deep pretrained language models with less forgetting. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 7870–7881
Chen X, Lin M, Schärli N, et al. (2023b) Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128
Chen Z, Singh AK, Sra M (2023c) LMExplainer: a knowledge-enhanced explainer for language models. arXiv preprint arXiv:2303.16537
Choi JH, Hickman KE, Monahan AB et al. (2021) ChatGPT goes to law school. J Legal Educ 71:387
Google Scholar
Chuang YN, Wang G, Chang CY, et al. (2024) Large language models as faithful explainers. arXiv preprint arXiv:2402.04678
Common Crawl Foundation (2024) Common crawl. https://commoncrawl.org/. Accessed 20 Feb 2024
Conmy A, Mavor-Parker A, Lynch A et al. (2024) Towards automated circuit discovery for mechanistic interpretability. Adv Neural Inf Process Syst 36:16318–16352
Google Scholar
Creswell A, Shanahan M (2022) Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271
Dai D, Dong L, Hao Y, et al. (2022) Knowledge neurons in pretrained transformers. In: Proceedings of the annual meeting of the association for computational linguistics, pp 8493–8502
de Santana Correia A, Colombini EL (2022) Attention, please! A survey of neural attention models in deep learning. Artif Intell Rev 55(8):6037–6124
Article Google Scholar
Deldjoo Y (2023) Fairness of ChatGPT and the role of explainable-guided prompts. arXiv preprint arXiv:2307.11761
Devlin J, Chang MW, Lee K, et al. (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the conference of the North American chapter of the association for computational linguistics, pp 4171–4186
Dufter P, Schmitt M, Schütze H (2022) Position information in transformers: an overview. Comput Linguistics 48(3):733–763
Article Google Scholar
Dwivedi R, Dave D, Naik H et al. (2023) Explainable AI (XAI): core ideas, techniques, and solutions. ACM Comput Surveys 55(9):1–33
Article Google Scholar
Elhage N, Hume T, Olsson C, et al. (2022) Softmax linear units. Transf Circ Thread, https://transformer-circuits.pub/2022/solu/index.html
Elyoseph Z, Hadar-Shoval D, Asraf K et al. (2023) ChatGPT outperforms humans in emotional awareness evaluations. Front Psychol 14:1199058
Article Google Scholar
Enguehard J (2023) Sequential integrated gradients: a simple but effective method for explaining language models. arXiv preprint arXiv:2305.15853
Ethayarajh K, Jurafsky D (2021) Attention flows are Shapley value explanations. arXiv preprint arXiv:2105.14652
European Union (2023) Eu AI act. https://artificialintelligenceact.eu/. Accessed 15 Feb 2024
Faubel L, Woudsma T, Methnani L, et al. (2023) Towards an MLOps architecture for XAI in industrial applications. arXiv preprint arXiv:2309.12756
Fok R, Weld DS (2023) In search of verifiability: explanations rarely enable complementary performance in AI-advised decision making. arXiv preprint arXiv:2305.07722
Foote A, Nanda N, Kran E, et al. (2023) Neuron to graph: interpreting language model neurons at scale. arXiv preprint arXiv:2305.19911
Gao Y, Sheng T, Xiang Y, et al. (2023) Chat-rec: towards interactive and explainable LLMS-augmented recommender system. arXiv preprint arXiv:2303.14524
Gawlikowski J, Tassi CRN, Ali M et al. (2023) A survey of uncertainty in deep neural networks. Artificial intelligence review 56(Suppl 1):1513–1589
Article Google Scholar
Geiger A, Lu H, Icard T et al. (2021) Causal abstractions of neural networks. Adv Neural Inf Process Syst 34:9574–9586
Google Scholar
Geva M, Caciularu A, Wang KR, et al. (2022) Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680
Ghorbani A, Abid A, Zou J (2019) Interpretation of neural networks is fragile. In: Proceedings of the AAAI conference on artificial intelligence, pp 3681–3688
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples. In: International conference on learning representations (ICLR), p 20
Goyal T, Xu J, Li JJ, et al. (2022) Training dynamics for text summarization models. In: Findings of the association for computational linguistics, pp 2061–2073
Gozalo-Brizuela R, Garrido-Merchan EC (2023) ChatGPT is not all you need: a state of the art review of large generative AI models. arXiv preprint arXiv:2301.04655
Grisold T, Schneider J (2023) Dynamics of human-AI delegation in organizational routines. In: Proceedings of the international conference on information systems
Grosse R, Bae J, Anil C, et al. (2023) Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296
Grynbaum MM, Mac R (2023) The times sues OpenAI and Microsoft over A.I. use of copyrighted work. https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html. Accessed 15 Feb 2024
Guidotti R (2022) Counterfactual explanations and how to find them: literature review and benchmarking. Data Mining Knowl Discov:1–55
Guidotti R, Monreale A, Ruggieri S et al. (2019) A survey of methods for explaining black box models. ACM Comput Surveys 51(5):1–42
Article Google Scholar
Gupta M, Akiri C, Aryal K et al. (2023) From ChatGPT to ThreatGPT: impact of generative AI in cybersecurity and privacy. IEEE Access 11:80218–80245
Article Google Scholar
Gurrapu S, Kulkarni A, Huang L et al. (2023) Rationalization for explainable NLP: a survey. Front Artif Intell 6:1225093
Article Google Scholar
Hernandez E, Li BZ, Andreas J (2023) Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740
He H, Zhang H, Roth D (2022) Rethinking with retrieval: faithful large language model inference. arXiv preprint arXiv:2301.00303
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 33:6840–6851
Google Scholar
Huang J, Chen X, Mishra S, et al. (2023a) Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798
Huang J, Gu SS, Hou L, et al. (2023b) Large language models can self-improve. In: Proceedings of the conference on empirical methods in natural language processing, pp 1051–1068
Huang Y, Song J, Wang Z, et al. (2023c) Look before you leap: an exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236
Jain S, Wallace BC (2019) Attention is not explanation. In: Proceedings of NAACL-HLT, pp 3543–3556
Jeanneret G, Simon L, Jurie F (2022) Diffusion models for counterfactual explanations. In: Proceedings of the Asian conference on computer vision, pp 219–237
Jin D, Jin Z, Zhou JT, et al. (2020) Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In: Proceedings of the AAAI conference on artificial intelligence, pp 8018–8025
Johny L, Dechant H, Schneider J (2024) Taking data scientists out-of-the-loop in knowledge intense analytics - a case study for product designs. In: European conference on information systems (ECIS), p 17
Kadavath S, Conerly T, Askell A, et al. (2022) Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221
Katz DM, Bommarito MJ, Gao S et al. (2024) Gpt-4 passes the bar exam. Phil Trans R Soc A 382:20230254
Article Google Scholar
Katz S, Belinkov Y (2023) Interpreting transformer’s attention dynamic memory and visualizing the semantic information flow of GPT. arXiv preprint arXiv:2305.13417
Kıcıman E, Ness R, Sharma A, et al. (2023) Causal reasoning and large language models: opening a new frontier for causality. arXiv preprint arXiv:2305.00050
Kim B, Wattenberg M, Gilmer J, et al. (2018) Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). In: International conference on machine learning, pp 2668–2677
King WR, He J (2005) Understanding the role and methods of meta-analysis in is research. Commun Assoc Inf Syst 16(1):32
Google Scholar
Kirillov A, Mintun E, Ravi N, et al. (2023) Segment anything. arXiv preprint arXiv:2304.02643
Kokalj E, Škrlj B, Lavrač N, et al. (2021) BERT meets shapley: Extending SHAP explanations to transformer-based classifiers. In: Proceedings of the EACL Hackashop on news media content analysis and automated report generation, pp 16–21
Kwon M, Jeong J, Uh Y (2022) Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960
Lewis P, Perez E, Piktus A et al. (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst 33:9459–9474
Google Scholar
Li L, Zhang Y, Chen L (2023) Personalized prompt learning for explainable recommendation. ACM Trans Inf Syst 41(4):1–26
Google Scholar
Li P, Pei Y, Li J (2023) A comprehensive survey on design and application of autoencoder in deep learning. Appl Soft Comput 138:110176
Article Google Scholar
Liao QV, Vaughan JW (2023) AI transparency in the age of LLMs: a human-centered research roadmap. arXiv preprint arXiv:2306.01941
Li J, Chen X, Hovy E, et al. (2016) Visualizing and understanding neural models in NLP. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, San Diego, pp 681–691, https://doi.org/10.18653/v1/N16-1082, https://aclanthology.org/N16-1082
Lin CH, Gao J, Tang L, et al. (2023) Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 300–309
Lin T, Wang Y, Liu X et al. (2022) A survey of transformers. AI Open 3:111–132
Article Google Scholar
Ling C, Zhao X, Lu J, et al. (2023) Domain specialization as the key to make large language models disruptive: a comprehensive survey. arXiv preprint arXiv:2305.18703
Liu LZ, Wang Y, Kasai J et al. (2021) Probing across time: what does RoBERTa know and when? Find Assoc Comput Linguistics: EMNLP 2021:820–842
Google Scholar
Liu T, Guo Q, Hu X, et al. (2022) RLET: A reinforcement learning based approach for explainable QA with entailment trees. arXiv preprint arXiv:2210.17095
Li M, Zhang Q (2023) Does a neural network really encode symbolic concepts? In: International conference on machine learning, PMLR, pp 20452–20469
Longo L, Brcic M, Cabitza F, et al. (2024) Explainable artificial intelligence (D) 2.0: a manifesto of open challenges and interdisciplinary research directions. Information Fusion 106:102301
Longpre S, Hou L, Vu T, et al. (2023) The flan collection: Designing data and methods for effective instruction tuning. In: International conference on machine learning, pp 22631–22648
Lou R, Zhang K, Yin W (2023) Is prompt all you need? No: a comprehensive and broader view of instruction learning. arXiv preprint arXiv:2303.10475
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 30:4768–4777
Google Scholar
Lundstrom DD, Huang T, Razaviyayn M (2022) A rigorous study of integrated gradients method and extensions to internal neuron attributions. In: International conference on machine learning, pp 14485–14508
Luo H, Specia L (2024) From understanding to utilization: a survey on explainability for large language models. arXiv preprint arXiv:2401.12874
Lyu Q, Apidianaki M, Callison-Burch C (2024) Towards faithful model explanation in NLP: a survey. Comput Linguistics 50:1–67
Article Google Scholar
MacKenzie IS (2024) Human-computer interaction: an empirical research perspective, 2nd edn. Morgan Kaufmann
Google Scholar
Ma Y, Liu Y, Yu Y, et al. (2023) At which training stage does code data help LLMS reasoning? arXiv preprint arXiv:2309.16298
Mao J, Ye J, Qian Y, et al. (2023) A language agent for autonomous driving. arXiv preprint arXiv:2311.10813
Marvin R, Linzen T (2018) Targeted syntactic evaluation of language models. arXiv preprint arXiv:1808.09031
Maynez J, Narayan S, Bohnet B, et al. (2020) On faithfulness and factuality in abstractive summarization. In: Proceedings of the annual meeting of the association for computational linguistics, pp 1906–1919
McKinsey & Company (2023) The economic potential of generative AI: the next productivity frontier. https://www.mckinsey.com/featured-insights/mckinsey-live/webinars/the-economic-potential-of-generative-ai-the-next-productivity-frontier, accessed: 2024-02-13
Meng K, Bau D, Andonian A et al. (2022) Locating and editing factual associations in GPT. Adv Neural Inf Process Syst 35:17359–17372
Google Scholar
Menick J, Trebacz M, Mikulik V, et al. (2022) Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147
Meronen L, Trapp M, Pilzer A, et al. (2024) Fixing overconfidence in dynamic neural networks. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2680–2690
Meske C, Bunde E, Schneider J et al. (2022) Explainable artificial intelligence: objectives, stakeholders, and future research opportunities. Inf Syst Manag 39(1):53–63
Article Google Scholar
Miller T (2019) Explanation in artificial intelligence: insights from the social sciences. Artif Intell 267:1–38
Article MathSciNet Google Scholar
Minh D, Wang HX, Li YF et al. (2022) Explainable artificial intelligence: a comprehensive review. Artif Intell Rev 55:3503–3568
Article Google Scholar
Mishra A, Soni U, Arunkumar A, et al. (2023) PromptAid: prompt exploration, perturbation, testing and iteration using visual analytics for large language models. arXiv preprint arXiv:2304.01964
Modarressi A, Fayyaz M, Aghazadeh E, et al. (2023) DecompX: explaining transformers decisions by propagating token decomposition. arXiv preprint arXiv:2306.02873
Mohebbi H, Modarressi A, Pilehvar MT (2021) Exploring the role of BERT token representations to explain sentence probing results. arXiv preprint arXiv:2104.01477
Molnar C (2020) Interpretable machine learning. https://christophm.github.io/interpretable-ml-book/
Montavon G, Binder A, Lapuschkin S, et al. (2019) Layer-wise relevance propagation: an overview. explainable AI: interpreting, explaining and visualizing deep learning pp 193–209
Mueller ST, Veinott ES, Hoffman RR, et al. (2021) Principles of explanation in human-AI systems. arXiv preprint arXiv:2102.04972
Nichol AQ, Dhariwal P, Ramesh A, et al. (2022) Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In: Proceedings of the international conference on machine learning, pp 16784–16804
Nickerson RC, Varshney U, Muntermann J (2013) A method for taxonomy development and its application in information systems. Eur J Inf Syst 22:336–359
Article Google Scholar
Olah C (2022) Mechanistic interpretability, variables, and the importance of interpretable bases. https://www.transformer-circuits.pub/2022/mech-interp-essay, accessed: 2024-02-15
Olah C, Cammarata N, Schubert L, et al. (2020) Zoom in: an introduction to circuits. Distill 5(3):e00024–001
Olsson C, Elhage N, Nanda N, et al. (2022) In-context learning and induction heads. arXiv preprint arXiv:2209.11895
OpenAI (2023a) Introducing the GPT store. https://openai.com/blog/introducing-the-gpt-store. Accessed 15 Feb 2024
OpenAI (2023b) Language models can explain neurons in language models. https://openai.com/research/language-models-can-explain-neurons-in-language-models?s=09. Accessed 15 Feb 2024
Ouyang L, Wu J, Jiang X et al. (2022) Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst 35:27730–27744
Google Scholar
Pan L, Saxon M, Xu W, et al. (2023) Automatically correcting large language models: surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188
Park DH, Hendricks LA, Akata Z, et al. (2018) Multimodal explanations: Justifying decisions and pointing to the evidence. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8779–8788
Poli M, Massaroli S, Nguyen E, et al. (2023) Hyena hierarchy: towards larger convolutional language models. arXiv preprint arXiv:2302.10866
Porter J (2023) ChatGPT continues to be one of the fastest-growing services ever. https://www.theverge.com/2023/11/6/23948386/chatgpt-active-user-count-openai-developer-conference, accessed: 2024-02-19
Radford A, Wu J, Child R et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Google Scholar
Rago A, Cocarascu O, Bechlivanidis C et al. (2021) Argumentative explanations for interactive recommendations. Artificial Intelligence 296:103506
Article MathSciNet Google Scholar
Ram O, Bezalel L, Zicher A, et al. (2022) What are you token about? Dense retrieval as distributions over the vocabulary. arXiv preprint arXiv:2212.10380
Ramesh A, Dhariwal P, Nichol A, et al. (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
Räuker T, Ho A, Casper S, et al. (2023) Toward transparent AI: a survey on interpreting the inner structures of deep neural networks. In: IEEE conference on secure and trustworthy machine learning (SaTML), pp 464–483
Reed S, Zolna K, Parisotto E, et al. (2022) A generalist agent. arXiv preprint arXiv:2205.06175
Reid M, Savinov N, Teplyashin D, et al. (2024) Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530
Ribeiro MT, Singh S, Guestrin C (2016) "Why should i trust you?" Explaining the predictions of any classifier. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, p 1135-1144
Ross AS, Hughes MC, Doshi-Velez F (2017) Right for the right reasons: training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. Adv Neural Inf Process Syst 30:3859–3869
Google Scholar
Saeed W, Omlin C (2023) Explainable AI (XAI): a systematic meta-survey of current challenges and future opportunities. Knowl-Based Syst 263:110273
Article Google Scholar
Saha S, Hase P, Rajani N, et al. (2022) Are hard examples also harder to explain? A study with human and model-generated explanations. In: Proceedings of the conference on empirical methods in natural language processing, pp 2121–2131
Saharia C, Chan W, Saxena S et al. (2022) Photorealistic text-to-image diffusion models with deep language understanding. Adv Neural Inf Process Syst 35:36479–36494
Google Scholar
Schick T, Dwivedi-Yu J, Dessì R et al. (2024) Toolformer: language models can teach themselves to use tools. Adv Neural Inf Process Syst 36:68539–68551
Google Scholar
Schneider J (2022) Optimizing human hand gestures for AI-systems. AI Commun 35(3):153–169
Article MathSciNet Google Scholar
Schneider J, Apruzzese G (2023) Dual adversarial attacks: fooling humans and classifiers. J Inf Secur Appl 75:103502
Google Scholar
Schneider J, Breitinger F (2023) Towards AI forensics: did the artificial intelligence system do it? J Inf Secur Appl 76:103517
Google Scholar
Schneider J, Vlachos M (2023) Explaining classifiers by constructing familiar concepts. Mach Learn 112:4167–4200
Article MathSciNet Google Scholar
Schneider J, Abraham R, Meske C et al. (2023) Artificial intelligence governance for businesses. Inf Syst Manag 40(3):229–249
Article Google Scholar
Schneider J, Meske C, Vlachos M (2023) Deceptive XAI: typology, creation and detection. SN Comput Sci 5(1):81
Article Google Scholar
Schneider J, Meske C, Kuss P (2024) Foundation models: a new paradigm for artificial intelligence. Bus Inf Syst Eng 66:221–231
Article Google Scholar
Schneider J, Abraham R, Meske C (2024a) Governance of generative artificial intelligence for companies. arXiv preprint arXiv:2403.08802
Schneider J, Haag S, Kruse LC (2023b) Negotiating with LLMS: prompt hacks, skill gaps, and reasoning deficits. arXiv preprint arXiv:2312.03720
Schneider J, Handali J (2019) Personalized explanation in machine learning: a conceptualization. In: Proceedings of the European conference on information systems (ECIS)
Schneider J, Kruse L, Seeber I (2024b) Validity claims in children-AI discourse: experiment with ChatGPT. In: Proceedings of the international conference on computer supported education
Schneider J, Vlachos M (2023b) Reflective-net: learning from explanations. Data Mining Knowl Discov:1–22
Schramowski P, Stammer W, Teso S et al. (2020) Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nat Mach Intell 2(8):476–486
Article Google Scholar
Schwalbe G, Finzel B (2023) A comprehensive taxonomy for explainable artificial intelligence: a systematic survey of surveys on methods and concepts. Data Mining Knowl Discov:1–59
Selva J, Johansen AS, Escalera S et al. (2023) Video transformers: a survey. IEEE Trans Pattern Anal Mach Intell 45(11):12922–12943
Article Google Scholar
Selvaraju RR, Cogswell M, Das A, et al. (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626
Serrano S, Smith NA (2019) Is Attention interpretable? In: Proceedings of the annual meeting of the association for computational linguistics, association for computational linguistics, pp 2931–2951
Shahsavar Y, Choudhury A (2023) User intentions to use ChatGPT for self-diagnosis and health-related purposes: cross-sectional survey study. JMIR Hum Factors 10(1):e47564
Article Google Scholar
Shen T, Jin R, Huang Y, et al. (2023) Large language model alignment: a survey. arXiv preprint arXiv:2309.15025
Sikdar S, Bhattacharya P, Heese K (2021) Integrated directional gradients: feature interaction attribution for neural NLP models. In: Proceedings of the annual meeting of the association for computational linguistics and the international joint conference on natural language processing, pp 865–878
Silva A, Schrum M, Hedlund-Botti E et al. (2023) Explainable artificial intelligence: evaluating the objective and subjective impacts of XAI on human-agent interaction. Int J Hum–Comput Interaction 39(7):1390–1404
Article Google Scholar
Singhal K, Azizi S, Tu T et al. (2023) Large language models encode clinical knowledge. Nature 620(7972):172–180
Article Google Scholar
Singh C, Hsu AR, Antonello R, et al. (2023) Explaining black box text modules in natural language with language models. arXiv preprint arXiv:2305.09863
Singh C, Inala JP, Galley M, et al. (2024) Rethinking interpretability in the era of large language models. arXiv preprint arXiv:2402.01761
Singh C, Morris JX, Aneja J, et al. (2022) Explaining patterns in data with language models via interpretable autoprompting. arXiv preprint arXiv:2210.01848
Slack D, Krishna S, Lakkaraju H et al. (2023) Explaining machine learning models with interactive natural language conversations using TalkToModel. Nat Mach Intell 5:873–883
Article Google Scholar
Sobania D, Briesch M, Hanna C, et al. (2023) An analysis of the automatic bug fixing performance of ChatGPT. arXiv preprint arXiv:2301.08653
Sottana A, Liang B, Zou K, et al. (2023) Evaluation metrics in the era of GPT-4: reliably evaluating large language models on sequence to sequence tasks. In: Proceedings of the conference on empirical methods in natural language processing, pp 8776–8788
Speith T (2022) A review of taxonomies of explainable artificial intelligence (XAI) methods. In: Proceedings of the ACM conference on fairness, accountability, and transparency, p 2239-2250
Sreedharan S, Kulkarni A, Kambhampati S (2022) Explainable human-AI interaction: a planning perspective. Springer Nature
Book Google Scholar
Stremmel J, Hill BL, Hertzberg J, et al. (2022) Extend and explain: interpreting very long language models. In: Machine learning for health, pp 218–258
Sun A, Ma P, Yuan Y et al. (2024) Explain any concept: segment anything meets concept-based explanation. Adv Neural Inf Process Syst 36:21826–21840
Google Scholar
Taori R, Gulrajani I, Zhang T, et al. (2023) Stanford alpaca: an instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca
Teehan R, Clinciu M, Serikov O, et al. (2022) Emergent structures and training dynamics in large language models. In: Proceedings of BigScience episode# 5–workshop on challenges & perspectives in creating large language models, pp 146–159
Tenney I, Xia P, Chen B, et al. (2019) What do you learn from context? Probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316
The Guardian (2023) Elon musk calls AI one of the biggest threats to humanity at summit. https://www.theguardian.com/technology/2023/nov/01/elon-musk-calls-ai-one-of-the-biggest-threats-to-humanity-at-summit, accessed: 2024-02-26
Theis S, Jentzsch S, Deligiannaki F, et al. (2023) Requirements for explainability and acceptance of artificial intelligence in collaborative work. In: International conference on human-computer interaction, pp 355–380
Theissler A, Spinnato F, Schlegel U et al. (2022) Explainable AI for time series classification: a review, taxonomy and research directions. IEEE Access 10:100700–100724
Article Google Scholar
Thirunavukarasu AJ, Ting DSJ, Elangovan K et al. (2023) Large language models in medicine. Nat Med 29(8):1930–1940
Article Google Scholar
Touvron H, Martin L, Stone K, et al. (2023) Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288
Turpin M, Michael J, Perez E et al. (2024) Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Adv Neural Inf Process Syst 36:74952–74965
Google Scholar
Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is all you need. Adv Neural Inf Process Syst:5998–6008
Vedula N, Collins M, Agichtein E, et al. (2023) Generating explainable product comparisons for online shopping. In: Proceedings of the ACM international conference on web search and data mining, p 949-957
Vig J (2019) A multiscale visualization of attention in the transformer model. In: Proceedings of the annual meeting of the association for computational linguistics: system demonstrations, pp 37–42
Walke F, Bennek L, Winkler TJ (2023) Artificial intelligence explainability requirements of the AI act and metrics for measuring compliance. In: Proceedings of Wirtschaftsinformatik
Wang Z, Cai S, Chen G, et al. (2023) Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560
Wang H, Shu K (2023) Explainable claim verification via knowledge-grounded reasoning with large language models. arXiv preprint arXiv:2310.05253
Wang K, Variengien A, Conmy A, et al. (2022b) Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. arXiv preprint arXiv:2211.00593
Wang B, Xu C, Liu X, et al. (2022a) Semattack: natural textual attacks via different semantic spaces. arXiv preprint arXiv:2205.01287
Webster J, Watson RT (2002) Analyzing the past to prepare for the future: writing a literature review. MIS Quart 26:xiii–xxiii
Wei J, Wang X, Schuurmans D et al. (2022) Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst 35:24824–24837
Google Scholar
Weidinger L, Uesato J, Rauh M, et al. (2022) Taxonomy of risks posed by language models. In: Proceedings of the ACM conference on fairness, accountability, and transparency, pp 214–229
White J, Fu Q, Hays S, et al. (2023) A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv preprint arXiv:2302.11382
Wu Z, Chen Y, Kao B, et al. (2020) Perturbed masking: parameter-free probing for analyzing and interpreting bert. arXiv preprint arXiv:2004.14786
Wu Z, Qiu L, Ross A, et al. (2023) Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477
Wu T, Ribeiro MT, Heer J, et al. (2021) Polyjuice: generating counterfactuals for explaining, evaluating, and improving models. arXiv preprint arXiv:2101.00288
Xing Z, Feng Q, Chen H, et al. (2023) A survey on video diffusion models. arXiv preprint arXiv:2310.10647
Xu P, Zhu X, Clifton DA (2023) Multimodal learning with transformers: a survey. IEEE Trans Pattern Anal Mach Intell 45(10):12113–12132
Article Google Scholar
Yang L, Zhang Z, Song Y et al. (2023) Diffusion models: a comprehensive survey of methods and applications. ACM Comput Surv 56(4):1–39
Article Google Scholar
Yang S, Huang S, Zou W, et al. (2023c) Local interpretation of transformer based on linear decomposition. In: Proceedings of the 61st annual meeting of the association for computational linguistics, pp 10270–10287
Yang K, Ji S, Zhang T, et al. (2023a) Towards interpretable mental health analysis with large language models. In: Proceedings of the conference on empirical methods in natural language processing, pp 6056–6077
Ye X, Durrett G (2022) The unreliability of explanations in few-shot prompting for textual reasoning. Adv Neural Inf Process Syst 35:30378–30392
Google Scholar
Yin K, Neubig G (2022) Interpreting language models with contrastive explanations. arXiv preprint arXiv:2202.10419
Yordanov Y, Kocijan V, Lukasiewicz T, et al. (2021) Few-shot out-of-domain transfer learning of natural language explanations in a label-abundant setup. arXiv preprint arXiv:2112.06204
Zaidan O, Eisner J, Piatko C (2007) Using annotator rationales to improve machine learning for text categorization. In: Human language technologies 2007: the conference of the North American chapter of the association for computational linguistics; proceedings of the main conference, pp 260–267
Zamfirescu-Pereira J, Wong RY, Hartmann B, et al. (2023) Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In: Proceedings of the CHI conference on human factors in computing systems, pp 1–21
Zhang S, Dong L, Li X, et al. (2023c) Instruction tuning for large language models: a survey. arXiv preprint arXiv:2308.10792
Zhang N, Yao Y, Tian B, et al. (2024) A comprehensive study of knowledge editing for large language models. arXiv preprint arXiv:2401.01286
Zhang C, Zhang C, Zhang M, et al. (2023a) Text-to-image diffusion model in generative AI: a survey. arXiv preprint arXiv:2303.07909
Zhang C, Zhang C, Zheng S, et al. (2023b) A survey on audio diffusion models: text to speech synthesis and enhancement in generative AI. arXiv preprint arXiv:2303.13336
Zhao H, Chen H, Yang F et al. (2023) Explainability for large language models: a survey. ACM Trans Intell Syst Technol 15(2):1–38
Article Google Scholar
Zhao R, Joty S, Wang Y, et al. (2023b) Explaining language models’ predictions with high-impact concepts. arXiv preprint arXiv:2305.02160
Zhong Z, Friedman D, Chen D (2021) Factual probing is [mask]: learning vs. learning to recall. arXiv preprint arXiv:2104.05240
Zhou Y, Zhang Y, Tan C (2023) FLamE: Few-shot learning from natural language explanations. arXiv preprint arXiv:2306.08042
Ziems C, Held W, Shaikh O, et al. (2023) Can large language models transform computational social science? arXiv preprint arXiv:2305.03514
Zini JE, Awad M (2022) On the explainability of natural language processing deep models. ACM Comput Surv 55(5):1–31
Article Google Scholar

Download references

Acknowledgements

Two very humble and reputable researchers also contributed to this manuscript but felt that their contribution did not deserve co-authorship or explicit mention in the acknowledgments, which is remarkable in the competitive field of research. I want to thank both of them cordially.

Funding

Open access funding provided by University of Liechtenstein.

Author information

Authors and Affiliations

University of Liechtenstein, Vaduz, Liechtenstein
Johannes Schneider

Authors

Johannes Schneider
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

There is just one author.

Corresponding author

Correspondence to Johannes Schneider.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Research methodology

Our approach combines elements of both systematic and narrative reviews. However, as explained, XAI for GenAI is relatively novel, with a limited number of technical works available. Our main focus was a more qualitative approach, such as a narrative literature review (King and He 2005). While we describe our search process below, our goal was not to quantitatively assess peer-reviewed works by showing the evolution and counts of publications over the years. Currently, works on XAI for GenAI lack a comprehensive conceptualization of important aspects that inform technical works, such as challenges and desiderata for explanations. Therefore, the importance and challenges we identified are a synthesis of existing works and a creative combination of essential properties of GenAI and prior XAI desiderata. Our taxonomy is based on (i) a meta-survey of (pre-)GenAI taxonomies and surveys, as many concepts and characteristics are still relevant for GenAI, and (ii) novel works contributing towards XAI for GenAI. We searched for several terms on Google Scholar between February 15, 2024, and March 15, 2024, using either "survey" or "review" in "Survey explainability," "Survey XAI for generative AI," "Survey explainability for large language models," "explainability large language models," "explainability transformer," and "explainability diffusion model." This was followed by forward and backward searches (Webster and Watson 2002). We filtered results based on title, abstract, and full-text as follows: We preferred peer-reviewed works but commonly included articles from arxiv.org after performing our quality assessment as reviewers. We were stricter with older works (on pre-GenAI) and more open to works on XAI for GenAI containing novel aspects, as there are fewer works, and we deem it important to include ideas that are still developing. Our taxonomy development process followed that of Nickerson et al. (2013); we alternated between looking at concepts (mostly synthesized in earlier surveys) and empirical data (primary research papers on XAI for GenAI) to derive our dimensions iteratively and, ultimately, our taxonomy.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Schneider, J. Explainable Generative AI (GenXAI): a survey, conceptualization, and research agenda. Artif Intell Rev 57, 289 (2024). https://doi.org/10.1007/s10462-024-10916-x

Download citation

Accepted: 12 August 2024
Published: 15 September 2024
DOI: https://doi.org/10.1007/s10462-024-10916-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Explainable Generative AI (GenXAI): a survey, conceptualization, and research agenda

Abstract

Similar content being viewed by others

What Does It Mean to Explain? A User-Centered Study on AI Explainability

VitrAI: Applying Explainable AI in the Real World

A Brief Review of Explainable Artificial Intelligence (XAI) Techniques

Explore related subjects

1 Introduction

2 Technical background

2.1 System architectures

2.2 GenAI model architectures

2.2.1 Transformers

2.2.2 Diffusion models

2.3 Other generative models: VAEs, GANs

2.4 Controlling outputs

2.5 LLM training

3 The importance, challenges and desiderata of GenXAI

3.1 Importance of XAI for GenAI

3.2 Why is XAI more challenging for GenAI?

3.3 Desiderata of explanations for GenAI

4 Taxonomy of XAI techniques for GenAI

4.1 Dimensions of taxonomy

4.2 Explanation properties

4.2.1 Scope

4.2.2 Explanation modality

4.2.3 Dynamics

4.3 Input and internal properties

4.3.1 Foundational sources for XAI techniques

4.3.2 Required model access by XAI method

4.3.3 Model (self-)explainers

4.3.4 Explanation sample difficulty

4.3.5 Dimensions of Pre-GenAI

4.4 Classification of techniques

4.4.1 Feature attribution

4.4.2 Sample-based

4.4.3 Probing-based

4.4.4 Mechanistic interpretability

4.4.5 Structuring based on novel dimensions

5 Research Agenda of XAI for GenAI, Discussion and Conclusions

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Appendix

Appendix

1.1 Research methodology

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation