Key words

1 Introduction

Generative adversarial networks are a type of neural network architecture, in which one network part generates solutions to a task and another part compares and rates the generated solutions against a priori known solutions. While at first glimpse this does not sound much different from any loss function, which essentially also compares a generated solution with the gold standard, there is one fundamental difference. A loss function is static, but the “judge” or “discriminator” network part is trainable (Fig. 1). This means that it can be trained to distinguish the generated from the true solutions and, as long as it succeeds in its task, a training signal for the generative part can be derived. This is how the notion of adversaries came into the name GAN. The discriminator part is trained to distinguish true from generated solutions, while the generative part is trained to arrive at the most realistic-appearing solutions, making them adversaries with regard to their aims.

Fig. 1
A schematic diagram represents the G A N setup. It consists of a generator in C N N model that produces fake images. A discriminator separates real images from the database and fake images.

The fundamental GAN setup for image generation consisting of a generator and a discriminator network; here, CNNs

Generative adversarial networks are now among the most powerful tools to create naturally looking images from many domains. While they have been created in the context of image generation, the original publication describes the general idea of how to make two networks learn by competing, regardless of the application domain. This key idea can be applied to generative tasks beyond image creation, including text generation, music generation, and many more.

The research interest skyrocketed in the years after the first publication proposing an adversarial training paradigm [1]. Looking at the number of web searches for the topic “generative adversarial networks” shows how the interest in the topic has rapidly grown but also the starting decline of the last years. Authors since 2014 have cast all kinds of problems into the GAN framework, to enable this powerful training mechanism for a variety of tasks, including image analysis tasks as well. This is surprising at first, since there is no immediate similarity between a generative task and, for example, a segmentation or detection task. Still, as evidenced by the success in these application areas, the adversarial training approach can be applied with benefits. Clearly, the decline in interest can to some degree be attributed to the emergence of best practices and proven implementations, while simultaneously the scientific interest has recently shifted to successor approaches. However, similar to the persistent relevance of CNN architectures like ResNets for classification, Mask R-CNNs for detection, or basic transformer architectures for sequence processing, GANs will remain an important tool for image creation and image analysis. The adversarial training paradigm has become an ingredient to models apart from generative aims, providing flexible ways to custom-tailor loss components for given tasks (compare Figs. 2 and 3).

Fig. 2
An area graph of Google search interest versus years from 2014 to 2022. It covers the maximum area between the years 2017 and 2021, indicating a high level of interest during these years.

Google web search-based interest estimate for “generative adversarial networks” since 2014. Relative scale

2 Generative Models

Generative processes are fundamentally hard to grasp computationally. Their nature and purpose is to create something “meaningful” out of something less meaningful (even random). The first question to ask therefore is how this can even be possible for a computer program since, intuitively, creation requires an inventive spirit—call it creativity, to use the term humans tend to associate with this. To introduce some of the terminology and basic concepts that we will use in the remainder of this section, some remarks on human creativity will set the scene.

Fig. 3
An area graph of Google search interest and hashtags named G A Ns versus years from 2014 to 2019. G A N code with a ranking of 6 between 2014 and 2015, D C G A N ranking 4, improved G A N 7, and pix 2 pix of 2 between 2016 and 2017, W G A N-G P with ranking of 8, cycle GAN of 1, and progressive G A N of 3 between 2017 and 2018.

Some of the most-starred shared GAN code repositories on Github, until 2018. Ranking within this selection in brackets

In fact, creative human acts are inherently limited by our concepts of the world, acquired by learning and experience through the sensory means we have available, and by the available expressive means (tools, instruments, …) with which we can even conceive of creating something. This is true for any kind of creative act, including writing, painting, wood carving, or any other art, and similarly also for computer programming, algorithm development, or science in general. Our limited internal representation of the world around us frames our creative scope.

This is very comparable to the way computerized, programmed, or learned generative processes create output. They have either an in-build mechanism, or a way to acquire such a mechanism, that represents the tools by which creation is possible, as well as a model of the world that defines the scope of outputs. Practically, a CNN-based generative process uses convolutions as the in-built tool and is by this tool geared to produce image-like outputs. The convolutional layers, if not a priori defined, will represent a set of operations defined by a training process and limited in their expressiveness by the training material—by the fraction of the world that was presented. This will lead us to the fundamental notion of how to capture the variability of the “fraction of the world” that is interesting and how to make a neural network represent this partial world knowledge. It is interesting to note at this point that neither for human creative artists nor for neural networks the ability to (re)create convincing results implies an understanding of the way the templates (in the real world) have come into existence. Generating convincing artifacts does not imply understanding nature. Therefore, GANs cannot explain the parts of nature they are able to generate.

2.1 The Language of Generative Models: Distributions, Density Estimation, and Estimators

Understanding the principles of generative models requires a basic knowledge of distributions. The reason is that—as already hinted at in the previous section—the “fraction of the world” is in fact something that can be thought of as a distribution in a parameter space. If you were to describe a part of the world in a computer-interpretable way, you would define descriptive parameters. To describe persons, you could characterize them by simple measures like age, height, weight, hair and eye color, and many more. You could add blood pressure, heart rate, muscle mass, maximum strength, and more, and even a whole-genome sequencing result might be a parameter. Each of the parameters individually can be collected for the world population, and you will obtain a picture of how this parameter is “distributed” worldwide. In addition, parameters will be in relation with each other, for example, age and maximum strength. Countless such relationships exist, of which the majority are and probably will remain unknown. Those interrelationships are called a joint distribution. Would you know the joint distribution, you could “create” a plausible parameter combination of a nonexisting human. Let us formalize these thoughts now.

2.1.1 Distributions

A distribution describes the frequency of particular observations when watching a random process. Plotting the number of occurrences over an axis of all possible observations creates a histogram. If the possible observations can be arranged on a continuous scale, one can see that observations cluster in certain areas, and we say that they create a “density” or are “dense” there. Hence, when trying to describe where densities are in parameter space, this is associated with the desire to reproduce or sample from distributions, like we want to do it to generate instances from a domain. Before being able to reproduce the function that generates observations, estimating where the dense areas are is required. This will in the most general sense be called density estimation.

Sometimes, the shape of the distribution follows an analytical formula, for example, the normal distribution. If such a closed-form description of the distribution can be given, for instance, the normal distribution, this distribution generalizes the shape of the histogram of observations and makes it possible to produce new observations very easily, by simply sampling from the distribution. When our observations follow a normal distribution, we mean that we expect to observe instances more frequently around the mean of the normal distribution than toward the tails. In addition, the standard deviation quantifies how much more likely observations close to the mean are compared to observations in the tails. We describe our observations with a parametric description of the observed density.

In the remainder of this section, rather than providing a rigorous mathematical definition and description of the mathematics of distributions and (probability) density estimation, we will introduce the basic concepts and terminology in an intuitive way (also compare Box 1). Readers with the wish for a more in-depth treatment can find tutoring material in the references [2,3,4,5,6].

Box 1: Probability Distributions: Terminology

Several common terms regarding distributions have intuitive interpretations which are given in the following. Let a be an event from the probability distribution A, written as a ∼ A, and b ∼ B an event from another probability distribution.

In a medical example, A might be the distribution of possible neurological diseases and B the distribution of all possible variations of smoking behavior.

Conditional Probability P(A|B):

The conditional probability of a certain a ∼ A, for example, a stroke, might depend on the concrete smoking history of a person, described by b ∼ B. The conditional probability is written as p(a|b) for the concrete instances or P(A|B) if talking about the entire probability distributions A and B.

Joint Probability P(A,  B):

The probability of seeing instantiations of A and B together is termed the joint probability. Notably, if expanded, this will lead to a large table of probabilities, joining each possible a ∼ A (e.g., stroke, dementia, Parkinson’s disease, etc.) with each possible b ∼ B (casual smoker, frequent smoker, nonsmoker, etc.).

Marginal Probability:

The marginal probabilities of A and B (denoted, respectively, P(A) and P(B)) are the probabilities of each possible outcome across (and independent of) all of the possible outcomes of the other distribution. For example, it is the probability of seeing nonsmokers across all neurological diseases or seeing a specific disease regardless of smoking status. It is said to be the probability of one distribution marginalized over the other probability distributions.

2.1.2 Density Estimation

We assume in the following that our observations have been produced by a function or process that is not known to us and that cannot be guessed from an arrangement of the observations. In a practical example, the images from a CT or MRI scanner are produced by such a function. Notably, the concern is less about the intractability of the imaging physics but about the appearance of the human body. The imaging physics might be modeled analytically up to a certain error. But the outer shape and inner structure of the human body and its organs depend on a large amount of mutually influencing factors. Some of these factors are known and can even be modeled, but many are not. In particular, the interdependence of factors must be assumed to be intractable. What we can accumulate is measured data providing information about the body, its shape, and its function. While many measurement instruments exist in medicine, for this chapter, we will be concerned with images as our observations. In the following thought experiment, we will explore a naïve way to model the distribution and try to generate images.

The first step is to examine the gray value distribution or, in other words, estimate the density of values. The most basic way for estimating a density is plotting a histogram. Let the value on the x axis be the image gray value of the medical image in question (in CT expressed in Hounsfield units (HU) and in arbitrary units for MRI). Two plots show histograms of a head MRI (Fig. 4) and an abdominal CT (Fig. 5). While the brain MRI suggests three or four major “bumps” of the histogram at about values 25, 450, and 600, the abdominal CT doesn’t lend itself to such a description.

Fig. 4
A cross-sectional view of the brain magnetic resonance image on the left. A line graph of count versus voxel value plots a curve that begins at (0, 1800), gradually rises to (25, 12000), then drops to (75, 300), remains constant with fluctuations, and ends at (1000, 0). Values are estimated.

Brain MRI (left) and histogram of gray values for one slice of a brain MRI

Fig. 5
A computed tomography scan of the abdomen, with bumps on the top left. A line graph of count versus voxel value plots a fluctuating curve with five peaks. The highest peaks are approximately at (20 2200) and (30 2200).

Abdominal CT (left) and histogram of gray values for one slice of an abdominal CT

In the next step, we want to describe the histograms through analytical functions, to make them amenable for computational ends. This means we will aim to estimate an analytical description of the observations.

Expectation maximization (EM; see Box 2) is an algorithm suitable for this task. EM enables us to perform maximum likelihood estimation in the presence of unobserved (“latent) variables and incomplete data—this being the default assumption when dealing with real data. Maximum likelihood estimation (MLE) is the process of finding parameters of a parametric distribution to most accurately match the distribution to the observations. In MLE, this is achieved by adapting the parameters steered by an error metric that indicates the closeness of the fit; in short, a parameter optimization algorithm.

Box 2: Expectation Maximization—Example

Focusing on our density estimate of the MRI data, we want to use expectation maximization (EM) to optimize the parameters of a fixed number of Gaussian functions adding up to the closest possible fit to the empirical shape of the histogram.

In our data, we observe “bumps” of the histogram. We can by image analysis determine that certain organs imaged by MRI lead to certain bumps in the histogram, since they are of different material and create different signal intensities. This, however, is unknown to EM—the so-called “latent” variables.

The EM algorithm has two parts, the expectation step and the maximization step. They can, with quite far-reaching omission of details, be sketched as follows:

Expectation:

takes each point (or a number of sampled points) of the distribution and estimates the expectation to which of the parameterized distribution to assign it to. Figuring out this assignment is the step of dealing with the “latent” variable of the observations.

Maximization:

iterates over all parameterized distributions and adjusts their parameters to match the assigned points as well as possible.

This process is iterated until a fitting error cannot be improved anymore.

A short introductory treatment of EM with examples and applications is presented in [7]. The standard reference for the algorithm is [8].

In Fig. 6, a mixture of four Gaussian distributions has been fit to the brain MRI voxel value data seen before.

Fig. 6
A bar graph with a line represents the Gaussian mixture model. The lines of Gaussian 1, 2, 3, 4, and Gaussian mixture have a fluctuating trend, with the highest peak marked approximately at (0, 0.012). The bars are in a fluctuating trend, with the highest bar marked approximately at (0, 0.013).

A Gaussian mixture model (GMM) of four Gaussians was fit to the brain MRI data we have visualized as a histogram in Fig. 4

It is tempting to model even more complex observations by mixing simple analytical distributions (e.g., Gaussian mixture models (GMMs)), but in general this will be intractable for two reasons. Firstly, realistic joint distributions will have an abundance of mixed maxima and therefore require a vast number of basic distributions to fit. Even basic normal distributions in high-dimensional parameter spaces are no longer functions with two parameters (μ, σ), but with a vector of means and a covariance matrix. Secondly, it is no longer trivial to sample from such high-dimensional joint distributions, and while some methods, among others Markov chain Monte Carlo methods, allow to sample from them, such numerical approaches are of such high computational complexity that it makes their use difficult in the context of deep neural network parameter estimation.

We will learn about alternatives. In principle, there are different approaches for density (distribution) estimation, direct distribution estimation, distribution approximation, or even more indirectly, by using a simple surrogate distribution that is made to resemble the unknown distribution as good as possible through a mapping function. We will see this in the further elaboration of generative modeling approaches.

2.1.3 Estimators and the Expected Value

Assume we have found suitable mean values and standard deviations for three normal distributions that together approximate the shape of the MRI data density estimate to our satisfaction. Such a combination of normal (Gaussian) distributions is called a Gaussian mixture model (GMM), and sampling from such a GMM is straightforward. We are thus able to sample single pixels in any number, and over time we will sample them such that their density estimate or histogram will look similar to the one we started with.

However, if we want to generate a brain MRI image using a sampling process from our closed-form GMM representation of the distribution, we will notice that a very important notion wasn’t respected in our approach. We start with one slice of 512 × 512 voxels and therefore randomly draw the required number of voxel values from the distribution. However, this will not yield an image that resembles one slice of a brain MRI, but will almost look like random noise, because we did not model the spatial relation of the gray values with respect to each other. Since the majority of voxels of a brain MRI are not independent of each other, drawing one new voxel from the distribution needs to depend on the spatial locations and gray values of all voxels drawn before. Neighboring voxels will have a higher likelihood of similar gray values than voxels far apart from each other, for example. More crucially, underneath the interdependence lies the image generation process: the image values observed in a real brain MRI stem from actual tissue—and this is what defines their interdependence. This means the anatomy of the brain indirectly reflects itself in the rules describing the dependency of gray values of one another.

For the modeling process, this implies that we cannot argue about single-voxel values and their likelihood, but we need to approach the generative process differently. One idea for a generative process has been implied in the above description already: pick a random location of the to-be-generated image and predict the gray value depending on all existing voxel values. Implemented with the method of mixture models, this results in unfathomably many distributions to be estimated, as for each possible “next voxel” location, any possible combination of already existing voxel numbers and positions needs to be considered. We will see in Subheading 5.1 on diffusion models how this general approach to image generation can still be made to work.

A different sequential approach to image generation has also been attempted, in which pixels are generated in a defined order, starting at the top left and scanning the image row by row across the columns. Again, the knowledge about the already produced pixels is memorized and used to predict the next voxel. This has been dubbed the PixelRNN (Pixel Recurrent Neural Network), which lends its general idea from text processing networks [9].

Lastly, a direct approach to image generation could be formulated by representing or approximating the full joint distribution of all voxels in one distribution that is tangible and to sample all voxels at once from this. The full joint distribution in this approach remains implicit, and we use a surrogate. This will actually be the approach implemented in GANs, though not in a naïve way.

Running the numbers of what a likelihood-based naïve approach implies, the difficulties of making it work will become obvious. Consider an MRI image as the joint distribution of 512 × 512 voxels (one slice of our brain MRI), where we approximated the gray value distribution of one voxel with a GMM with six parameters. This results in a joint distribution of 512 × 512 × 6 = 1, 572, 864 parameters. Conceptually, this representation therefore spans a 1,572,864-dimensional space, in which every one brain MRI slice will be one data point. Referring back to the histograms of CT and MRI images in the figures above, we have seen continuous lines with densities because we have collected all voxels of an entire medical image, which are many million. Still, we only covered one single dimension out of the roughly 1.5 million. Searching for the density in the 1,572,864-dimensional MRI-slice-space that is given by all collected brain MRI slices is the difficult task any generative algorithm has to solve. In this vastly large space, the brain MRI slices “live” in a very tiny region that is extremely hard to find. We say the images occupy a low-dimensional manifold within the high-dimensional space.

Consider the maximum likelihood formulation

$$ \hat{\theta}=\arg \underset{\theta }{\max }{\mathbbm{E}}_{x\sim {P}_{\mathrm{data}}}\log {Q}_{\theta}\left(x|\theta \right) $$
(1)

where Pdata is the unknown data distribution and Qθ the distribution generated by the model which is parameterized by θ. θ can, for example, be the weights and biases of a deep neural network.Footnote 1 In other words, the result of maximum likelihood estimation is parameters \( \hat{\theta} \) so that the product of two terms, out of which only the second depends on the choice of θ, is maximal. The first term is the expectation of x with regard to the real data distribution. The second term is the (log of) the conditional probability (likelihood) of seeing the example x given the choice of θ under the model Qθ. Hence, maximizing the likelihood function means maximizing the probability that x is seen in Qθ, which will be the case when Q matches P as closely as possible given the parametric form of Q.

The maximum likelihood mechanism is very nicely illustrated in [10]. Here, it is also visually shown how finding the maximum likelihood estimate of parameters of the distribution can be done by working with partial derivatives of the likelihood function with respect to μ and σ2 and seeking their extrema. The partial derivatives are called the score function and will make a reappearance when we discuss score-based and diffusion models later in Subheading 5.1 on advanced generative models.

2.1.4 Sampling from Distributions

When a distribution is a model of how observed values occur, then sampling from this distribution is the process of generating random new values that could have been observed, with a probability similar to the probability to observe this value in reality. There are two basic approaches to sampling from distributions: generating a random number from the uniform distribution (this is what a random number generator is always doing underneath) and feeding this number through the inverse cumulative density function (iCDF) of the distribution, which is the function that integrates the probability density function (PDF) of the distribution. This can only be achieved if the CDF is given in closed form. If it is not, the second approach to sampling can be used, which is called acceptance (or rejection) sampling. With f being the PDF, two random numbers x and y are drawn from the uniform distribution. The random x is accepted, if f(x) > y, and rejected otherwise.

Our use case, as we have seen, involves not only high-dimensional (multivariate) distributions but even more their joints, and they are not given in closed form. In such scenarios, sampling can be done still, using Markov chain Monte Carlo (MCMC) sampling, which is a framework using rejection sampling with added mechanisms to increase efficiency. While MCMC has favorable theoretic properties, it is still computationally very demanding for complex joint distributions, which leads to important difficulties in the context of sampling from distributions we are facing in the domain of image analysis and generation.

We are therefore at this point facing two problems: we can hardly hope to be able to estimate the density, and even if we could, we could practically not sample from it.

3 Generative Adversarial Networks

3.1 Generative vs. Discriminative Models

To emphasize the difficulty that generative models are facing, compare them to discriminative models. Discriminative models solve tasks like classification, detection, and segmentation, to name some of the most prominent examples. How classification models are in the class of discriminative models is obvious: discriminating examples is exactly classifying them. Detection models are also discriminative models, though in a broader sense, in that they classify the detection proposals into accepted object detections or rejected proposals, and even the bounding box estimation, which is often solved through bounding box regression, typically involves the discriminative prediction of template boxes. Segmentation, on the other hand, for example, using a U-Net, is only the extension of classic discriminative approaches into a fast framework that avoids pixel-wise inference through the model. It is common to all these models that they yield output corresponding to their input, in the sense that they extract information from the input image (e.g., an organ segmentation, a classification, or even a textual description of the image content) or infer additional knowledge about it (e.g., a volume measurement or an assessment or prediction of a treatment success given the appearance of the image).

Generative models are fundamentally different, in that they generate output potentially without any concrete input, out of randomness. Still, they are supposed to generate output that conforms to certain criteria. In the most general form and intuitive formulation, their output should “look natural.” We want to further formalize the difference between the models in the following by using the perspective of distributions again. Figure 7 shows how discriminative and generative models have to construct differently complex boundaries in the representation space of the domain to accomplish their tasks.

Fig. 7
A schematic diagram compares discriminative and generative tasks with x values at the center. Arrows from values of x point to the separation line in the discriminative task on the left. On the right, they point to spaces in the generative task.

The discriminative task compared to the generative task. Discriminative models only need to find the separating line between classes, while generative models need to delineate the part of space covering the classes (figure inspired by: https://developers.google.com/machine-learning/gan/generative)

Discriminative models take one example and map it to a label—e.g., the class. This is also true for segmentation models: they do this for each image voxel. The conceptual process is that the model has to estimate the probabilities that the example (or the voxel) comes from the distribution of the different available classes. The distributions of all possible appearances of objects of all classes do not need to be modeled analytically for this to be successful. It is only important to know them locally—for example, it is sufficient to delineate their borders or overlaps with other distributions of other classes, but not all boundaries are important.

Generative models, on the other hand, are tasked to produce an example that is within a desired distribution. For this to work, the network has to learn the complete shape of this distribution. This is immensely complex, since all domains of practical importance in medical imaging are extremely high-dimensional and the distributions defining examples of interest within these domains are very small and hard to find. Also, they are neither analytically given nor normally distributed in their multidimensional space. But they have as many parameters as the output image of interest has voxels.

As already remarked, different other approaches were devised to generate output before GANs entered the scene. Among the trainable ones, approaches comprised (restricted) Boltzmann machines, deep belief networks, or generative stochastic networks, variational autoencoders, and others. Some of them involved feedback loops in the inference process (the prediction of a generated example) and were therefore unstable to train using backpropagation.

This was solved with the adversarial net framework proposed in 2014 by Goodfellow et al. [1]. They tried to solve the downsides like computational intractability or instability of such previous generative models by introducing the adversarial training framework.

To understand how GANs relate to one of the closest predecessors, the variational autoencoder, we will review their basic layout next. We will learn how elegantly the GAN paradigm turns the previously unsupervised approach to generative modeling into a supervised one, with the benefit of much more control over the training process.

3.2 Before GANs: Variational Autoencoders

Generative adversarial networks (GANs) haven’t been the first or only attempt at generating realistically looking images (or any type of output, generally speaking). Apart from GANs, a related neural network-based approach to generative modeling is the variational autoencoder, which will be treated in more details below. Among other generative models with different approaches are as follows:

Flow-based models :

This category of generative models attempt to model the data-generating distribution explicitly through an iterative process known as the normalizing flow [11], in which through repeated changes of variables a sequence of differentiable basis distributions is stacked to model the target distribution. The process is fully invertible, yielding models with desirable properties, since an analytical solution to the data-generating distribution allows to directly estimate densities to predict the likelihood of future events, impute missing data points, and of course generate new samples. Flow-based models are computation-intensive. They can be categorized as a method that returns an explicit, tractable density. Another method in this category is, for example, the PixelRNN [9] or the PixelCNN [12] which also serves for conditional image generation. RealNVP [13] also uses a chain of invertible functions.

Boltzmann machines :

work fundamentally differently. They also return explicit densities but this time only approximate the true target distribution. In this regard, they are similar to variational autoencoders, though their method is based on Markov chains, and not a variational approach. Deep Boltzmann machines have been proposed already in 2009, uniting a Markov chain-based loss component with a maximum likelihood-based component and showing good results on, at that time, highly complex datasets. [14] Boltzmann machines are very attractive but harder to train and use than other comparably powerful alternatives that exist today. This might change with future research, however.

Variational autoencoders (VAE) are a follow-up development of plain autoencoders, autoregressive models that in their essence try to reconstruct their input after transforming it, usually into a low-dimensional representation (see Fig. 8). This low-dimensional representation is often termed the “latent space,” implying that here hidden traits of the data-generating process are coded, which are essential to the reconstruction process. This is very akin to the latent variables estimated by EM. In the autoencoder, the encoder will learn to code its input in terms of these latent variables, while the decoder will learn to represent them again in the source domain. In the following, we will be discussing the application to images though, in principle, both autoencoders and their variational variant are general mechanisms working for any domain.

Fig. 8
A schematic diagram represents an autoencoder network. The network consists of a real image with encoder C N N block of latent in the center, the decoder C N N and the fake image on the right.

Schematic of an autoencoder network. The encoder, for images, for example, a CNN with a number of convolutional and pooling layers, condenses the defining information of the input image into the variables of the latent space. The decoder, again convolutions, but this time with upsampling layers, recreates a representation in image space. Input and output images are compared in the loss function, which drives the gradient descent

We will later be interested in a behind-the-scene understanding of their modeling approach, which will be related to the employed loss function. We will then look at VAEs more extensively from the same vantage point: to understand their loss function—which is closest to the loss formulation of early GANs, the Kullback-Leibler divergence or KL divergence, DKL.

With this tool in hand, we will examine how to optimize (train) a network with regard to KL divergence as the loss and understand key problems with this particular loss function. This will lead us to the motivation for a more powerful alternative.

3.2.1 From AE to VAE

VAEs are an interesting subject to study to emphasize the limits a loss function like KL divergence may place on a model. We will begin with a recourse to plain autoencoders to introduce the concept of learning a latent representation. We will then proceed to modify the autoencoder into a variational formulation which brings about the switch to a divergence measure as a loss function. From these grounds, we will then show how GANs again modified the loss function to succeed in high-quality image generation.

Figure 8 shows the schematic of a plain autoencoder (AE). As indicated in the sketch, input and output are of potentially very high dimensionality, like images. In between the encoder and decoder networks lies a “bottleneck” representation, which is, for example, a convolutional layer of orders of magnitude lower dimensionality (represented, for example, by a convolutional layer with only a few channels or a dense layer with a given low number of weights), which forces the network to find an encoding that preserves all information required for reconstruction.

A typical loss function to use when training the autoencoder is, for example, cross entropy, which is applicable for sigmoid activation functions, or simply the mean squared error (MSE). Any loss shall essentially force the AE to learn the identity function between input and output.

Let us introduce the notation for this. Let X be the input image tensor and X′ the output image tensor. With fw being the encoder function given as a neural network parameterized by weights and biases w and gv the decoder function parameterized by v, the loss hence works to make X = X′ = gv(fw(X)).

In a variational autoencoder,Footnote 2 things work differently. Autoencoders like before use a fixed (deterministic) latent code to map the input to, while variational autoencoders will replace this with a distribution. We can call this distribution pw, indicating the parameterization by w. It is crucial to understand that a choice was made here that imposes conditions on the latent code. It is meant to represent the input data in a variational way: in a way following Bayes’ laws. Our mapping of the input image tensor X to the latent variable z is by this choice defined by

  • The prior probability pw(z)

  • The likelihood (conditional probability) pw(X|z)

  • The posterior probability pw(z|X)

Therefore, once we have obtained the correct parameters \( \hat{w} \) by training the VAE, we can produce a new output X′ by sampling a z(i) from the prior probability \( {p}_{\hat{w}}\left(\mathbf{z}\right) \) and then generate the example from the conditional probability through \( {X}^{(i)}={p}_{\hat{w}}\left(X|\mathbf{z}={\mathbf{z}}^{(i)}\right) \).

Obtaining the optimal parameters, however, isn’t possible directly. The searched optimal parameters are those that maximize the probability that the generated example X′ looks real. This probability can be rewritten as the aggregated conditional probabilities:

$$ {p}_w\left({X}^{(i)}\right)=\int {p}_w\left({X}^{(i)}|\mathbf{z}\right){p}_w\left(\mathbf{z}\right)d\mathbf{z}. $$

This, however, does not make the search any easier since we need to enumerate and sum up all z. Therefore, an approximation is made through a surrogate distribution, parameterized by another set of parameters, qv. Weng [15] shows in her explanation of the VAE the graphical model highlighting how qv is a stand-in for the unknown searched pw (see Fig. 9).

Fig. 9
A schematic of the variational autoencoder. On the left, unknown searched p w leads to z in the center via P w of z, z leads to x on the right via q v of dot on top, and P x divides z at the bottom. The unknown searched p w leads directly to x via P subscript w of x.

The graphical model of the variational autoencoder. In a VAE, the variational decoder is pw(X|z), while the variational encoder is qv(z|X) (Figure after [15])

The reason to introduce this surrogate distribution actually comes from our wish to train neural networks for the decoding/encoding functions, and this requires us to back-propagate through the random variable, z, which of course cannot be done. Instead, if we have control over the distribution, we can select it such that the reparameterization trick can be employed. We define qv to be a multivariate Gaussian distribution with means and a covariance matrix that can be learned and a stochastic element multiplied to the covariance matrix for sampling [15, 16]. With this, we can back-propagate through the sampling process.

At this point, the two distributions need to be made to match: qv should be as similar to pw as possible. Measuring their similarity can be done in a variety of ways, of which Kulback-Leibler divergence (KL divergence or KLD) is one.

3.2.2 KL Divergence

A divergence can be thought of as an asymmetric distance function between two probability distributions, P and Q, measuring the similarity between them. It is a statistical distance which is not symmetric, which means it will not yield the same value if measured from P to Q or the other way around:

$$ {D}_{\mathrm{KL}}\left(P\parallel Q\right)\ne {D}_{\mathrm{KL}}\left(Q\parallel P\right) $$

This can be seen when looking at the definition of KL divergence:

$$ {D}_{\mathrm{KL}}\left(P\parallel Q\right)=\sum \limits_xP(x)\log \frac{P(x)}{Q(x)} $$
(2)

Sometimes, the measure DKL is also called the relative entropy or information gain of P over Q, which also indicates the asymmetry.

To give the two distributions more meaning, let us associate them with a use case. P is usually the probability distribution of the example data, which can be our real images we wish to model, and is assumed to be unknown and high-dimensional. Q, on the other hand, is the modeled distribution, for example, parameterized by θ, similar to Eq. 1. Hence, Q is the distribution we can play with (in our case, optimize its parameters) to make them more similar to P. This means Q will get more informative with respect to the true P when we approach the optimal parameters.

Box 3: Example: Calculating DKL

When comparing the two distributions given in Fig. 10, the calculation of the Kullback-Leibler divergence, DKL, can explicitly be given by reading off the y values of the nine elements (columns) from Fig. 11 and inserting them into Eq. 2.

Fig. 10
Two bar graphs depict P and Q distributions. P has 9 bars, which increase from 1 to 5, then decrease, with bar 5 being the highest. Q has 9 bars, which increase from 1 to 3, then decrease, with bar 3 being the highest.

Two distributions P and Q, here scaled to identical height

Fig. 11
A grouped bar graph of the distributions P and Q. The values of P bars are 0, 02. 0, 04. 0, 09. 0, 17. 0, 35. 0, 17. 0, 09. 0, 04. 0, 02. The values of Q are 0, 01. 0, 12. 0, 24. 0, 22. 0, 18. 0, 12. 0, 06. 0, 03. 0, 02.

The distributions P and Q, scaled to unit density, with added labels

The result of this calculation is for

$$ {\displaystyle \begin{array}{rll}{D}_{\mathrm{KL}}\left(P\parallel Q\right)& =\sum \limits_xP(x)\log \frac{P(x)}{Q(x)}& \\ {}& =0.02\ast \log \frac{.02}{.01}+0.04\ast \log \frac{.04}{.12}+\cdots +0.02\ast \log \frac{.02}{.022}\\ {}& =0.004-0.01+\cdots -0.0002\\ {}& =0.0801\end{array}} $$

which we call “forward KL” as it calculates in the direction from the actual distribution P to the model distribution Q and for

$$ {\displaystyle \begin{array}{rll}{D}_{\mathrm{KL}}\left(Q\parallel P\right)& =\sum \limits_xQ(x)\log \frac{Q(x)}{P(x)}& \\ {}& =0.01\ast \log \frac{0.01}{0.02}+0.12\ast \log \frac{0.12}{0.04}+\cdots +0.022\ast \log \frac{0.022}{0.02}\\ {}& =-0.002-0.05+\cdots +0.0002\\ {}& =0.0899\end{array}} $$

which we call “reverse KL.”

Note that in the example in Box 3, there is both a P(X = xi) and Q(X = xi) for each i ∈{0, 1, …, 8}. This is crucial for KL divergence to work as a loss function.

3.2.3 Optimizing the KL Divergence

Examine what happens in forward and reverse KL if this condition is not satisfied for some i. If in forward KL P has values everywhere but Q has not (or extremely small values), the quotient in the \( \log \) function will tend to infinity by means of the division by almost zero, and the term will be very large.

In Fig. 12, we assume Qθ to be a unimodal normal distribution, i.e., a Gaussian, while P is any empirical distribution. In the left plots of the figure, we show a situation before minimizing the forward/reverse KL divergence between P and Qθ, in the right plots, the resulting shape of the Gaussian after minimization.

Fig. 12
4 line graphs represent the distribution of Q theta and P. Where Q theta initially increases, then decreases, while P initially increases, then decreases, then increases and decreases. 2 graphs at the top exhibit Q theta equals 0 and P greater than 0, while the bottom 2 graphs exhibit Q theta greater than 0 and p equals 0.

The distributions P (solid) and Qθ (dashed), in the initial configuration and after minimizing reverse KL DKL(Qθ|P). This time, in the initial configuration, Qθ has values greater than 0 where P has not (marked with green shading)

When in the minimization of forward KL DKL(P|Qθ) Qθ is zero where P has values greater zero, KL goes to infinity in these regions (marked area in the start configuration of the top row in Fig. 12), since the denominator in the \( \log \) function goes to zero. This, in turn, drives the parameters of Qθ to broaden the Gaussian to cover these areas, thereby removing the large loss contributions. This is known as the mean-seeking behavior of forward KL.

Conversely, in reverse KL (bottom row in Fig. 12), in the marked areas of the initial configuration, P is zero in regions where Qθ has values greater than zero. This yields high-loss contributions from the \( \log \) denominator, in this case driving the Gaussian to remove these areas from Qθ. Since we assumed a unimodal Gaussian Q, the minimization will focus on the largest mode of the unknown P. This is known as the mode-seeking behavior of reverse KL.

Forward KL tends to overestimate the target distribution, which is exaggerated in the right plot in Fig. 12. In contrast, reverse KL tends to underestimate the target distribution, for example, by dropping some of its modes. Since underestimation is the more desirable property in practical settings, reverse KL is the loss function of choice, for example, in variational autoencoders. The downside is that as soon as target distribution P and model distribution Qθ have no overlap, KL divergence evaluates to infinity and is therefore uninformative. One countermeasure to take is to add noise to Qθ, so that there is guaranteed overlap. This noise, however, is not desirable in the model distribution Qθ since it disturbs the generated output.

Another way to remedy the problem of KL going to infinity is to adjust the calculation of the divergence, which is done in Jensen-Shannon divergence (JS divergence, DJS) defined as

$$ {D}_{\mathrm{JS}}=\frac{1}{2}\left({D}_{\mathrm{KL}}\left(P\parallel M\right)+{D}_{\mathrm{KL}}\left({Q}_{\theta}\parallel M\right)\right), $$
(3)

where \( M=\frac{P+{Q}_{\theta }}{2} \). In the case of nonoverlapping P and Qθ, this evaluates to constant \( \log 2 \), which is still not providing information about the closeness but is computationally much friendlier and does not require the addition of a noise term to achieve numerical stability.

3.2.4 The Limits of VAE

In the VAE, reverse KL is used. Our optimization goal is maximizing the likelihood to produce realistic looking examples—ones with a high pw(x). Simultaneously, we want to minimize the difference between the real and estimated posterior distributions qv and pw. This can only be achieved through a reformulation of reverse KL [15]. After some rearranging of reverse KL, the loss of the variational autoencoder becomes

$$ {\displaystyle \begin{array}{rll}{L}_{\mathrm{VAE}}\left(w,v\right)& =-\log {p}_w(X)+{D}_{\mathrm{KL}}\left({q}_v\left(\mathbf{z}|X\right)\parallel {p}_w\left(\mathbf{z}|X\right)\right)& \\ {}& =-{\mathbbm{E}}_{\mathbf{z}\sim {q}_v\left(\mathbf{z}|X\right)}\log {p}_w\left(X|\mathbf{z}\right)+{D}_{\mathrm{KL}}\left({q}_v\left(\mathbf{z}|X\right)\parallel {p}_w\left(\mathbf{z}\right)\right)\end{array}} $$
(4)

\( \hat{w} \) and \( \hat{v} \) are the parameters maximizing the loss.

We have seen how mode-seeking reverse KL divergence limits the generative capacity of variational autoencoders through the potential underrepresentation of all modes of the original distribution.

KL divergence and minimizing the ELBO also have a second fundamental downside: there is no way to find out how close our solution is to the obtainable optimum. We measure the similarity to the target distribution up to the KL divergence, but since the true \( {p}_{\hat{w}}(.) \) is unknown, the stopping criterion in the optimization has to be set by another metric, e.g., to a maximum number of iterations or corresponding to an improvement of the loss below some ε.

The original presentation of the variational autoencoder was given as one example of the general framework called the autoencoding variational Bayes. This publication presented the above ideas in a thorough mathematical formulation, starting from a directed graphical model that poses the abstract problem. The authors also develop the seminal “reparameterization trick” to make the loss formulation differentiable and with this to make the search for the autoencoder parameters amenable to gradient descent optimizers [16]. The details are beyond this introductory treatment.

3.3 The Fundamental GAN Approach

At the core of the adversarial training paradigm is the idea to create two players competing in a minimax game. In such games, both players have access to the same variables but have opposing goals, so that they will manipulate the variables in different directions.

Referring to Fig. 13, we can see the generative part in orange color, where random numbers are drawn from the latent space and, one by one, converted into a set of “fake images” by the generator network, in the figure implemented by a CNN. Simultaneously, from a database of real images, a matching number of examples are randomly drawn. The real and fake images are composed into one batch of images which are fed into the discriminator. On the right side, the discriminator CNN is indicated in blue. It takes the batch of real and fake images and decides for each if it appears real (yielding a value close to “1”) or fake (“0”).

Fig. 13
A schematic of a G A N network. The network contains a Latent generator C N N that generates fake images sent to the batch, and real images from the database are also sent to the batch. They are fed to discriminator C N N that gives the values of 0 and 1 for fake and real images, respectively.

Schematic of a GAN network. Generator (orange) creates fake images based on random numbers drawn from a latent space. These together with a random sample of real images are fed into the discriminator (blue, right). The discriminator looks at the batch of real/fake images and tries to assign the correct label (“0” for fake, “1” for real)

The error signal is computed from the number of correct assignments the discriminator can do on the batch of generated and real images. Both the generator and the discriminator can then update their parameters based on this same error signal. Crucially, the generator has the aim to maximize the error, since this signifies that it has successfully fooled the discriminator into taking the fake images for real, while the discriminator weights are updated to minimize the same error, indicating its success in telling true and fake examples apart. This is the core of the competitive game between generator and discriminator.

Let us introduce some abbreviations to designate GAN components. We will denote the generator and discriminator networks with G and D, respectively. The objective of GAN training is a game between generator and discriminator, where both affect a common loss function J, but in opposed directions. Formally, this can be written as

$$ \underset{G}{\min}\underset{D}{\max }J\left(G,D\right), $$

with the GAN objective function

$$ J\left(G,D\right)={\mathbbm{E}}_{x\sim {p}_{\mathrm{data}}}\left[\log D(x)\right]+{\mathbbm{E}}_{z\sim {p}_G}\left[1-\log D\left(G(z)\right)\right] $$
(5)

D will attempt to maximize J by maximizing the probability to assign the correct labels to real and generated examples: this is the case if D(x) = 1, maximizing the first loss component, and if D(G(z)) = 0, maximizing the second loss component. The generator G, instead, will attempt to generate realistic examples that the discriminator labels with “1,” which corresponds to a minimization of \( \log \left(1-D\left(G(z)\right)\right) \).

3.4 Why Early GANs Were Hard to Train

GANs with this training objective implicitly use JS divergence for the loss, which can be seen by examining the GAN training objective. Consider the ideal discriminator D for a fixed generator. Its loss is minimal for the optimal discriminator given by [1]

$$ \hat{D}(x)=\frac{p_{\mathrm{data}}(x)}{p_{\mathrm{data}}(x)+{p}_G(x)}. $$
(6)

Substituting \( \hat{D} \) in Eq. 5 yields (without proof) the implicit use of the Jensen-Shannon divergence if the above training objective is employed:

$$ J\left(G,\hat{D}\right)=2{D}_{\mathrm{JS}}\left({p}_{\mathrm{data}}\parallel {p}_G\right)-\log 4. $$
(7)

This theoretical result shows that a minimum in the GAN training can be found when the Jensen-Shannon divergence is zero. This is achieved for identical probability distributions pdata and pG or, equivalently, when the generator perfectly matches the data distribution [17].

Unfortunately, it also shows that this loss is, like KL divergence, only helpful when target distribution (i.e., data distribution) and model distribution have overlapping support. Therefore, added noise can be required to approximate the target distribution. In addition, the training criterion saturates if the discriminator in the early phase of training perfectly distinguishes between fake and real examples. The generator will therefore no longer obtain a helpful gradient to update its weights. An approach thought to prevent this was proposed by Goodfellow et al. [1]. The generator loss was turned from the minimization problem into a maximization problem that has the same fixed point in the overall minimax game but prevents saturation: instead of minimizing \( \log \left(1-D\left(G(z)\right)\right) \), one maximizes \( \log \left(D\left(G(z)\right)\right) \) [1].

3.5 Improving GANs

GAN training has quickly become notorious for the difficulties it posed upon the researchers attempting to apply the mechanism to real-world problems. We have qualitatively attributed a part of these problems to the inherently difficult task of density estimation and motivated the intuition that while fewer samples might suffice to learn a decision boundary in a discriminative task, many more examples are required to build a powerful generative model.

In the following, some more light shall be shed on the reasons why GAN training might fail. Typical GAN problems comprise the following:

Mode dropping :

is the phenomenon in forward KL caused by regions of the data distribution not being covered by the generator distribution, which implies large probabilities of samples coming from Pdata and very small probabilities of originating from PG. This drives forward KL toward infinity and punishes the generator for not covering the entire data distribution [18]. If all modes but one are dropped, one can call this mode collapse: the generator only generates examples from one mode of the distribution.

Poor convergence :

can be caused by a discriminator learning to distinguish real and fake examples very early—which is also very likely to happen throughout the GAN training. This is rooted in the observation that by the generative process that projects from a low-dimensional latent space into the high-dimensional pG, the samples in pG are not close to each other but rather inhabit “islands” [18]. The discriminator can learn to find them and thereby differentiate between true and false samples easily, which causes the gradients driving generator optimization to vanish [17].

Poor sample quality :

despite a high log likelihood of the model is a consequence of the practical independence of sample quality and model log likelihood. Theis et al. [19] show that neither does a high log likelihood imply generated sample fidelity nor do visually pleasing samples imply a high log likelihood. Therefore, training a GAN with a loss function that effectively implements maximizing a log likelihood term is not an ideal choice—but exactly corresponds to KL minimization.

Unstable training :

is a consequence of reformulating the generator loss into maximizing \( \log D\left(G(z)\right) \). It can be shown [18] that this choice effectively makes the generator struggle between a reverse KL divergence favoring mode-seeking behavior and a negative JS divergence actually driving the generator into examples different from the real data distribution.

There have been many subsequent authors touching these topics, but already Arjovsky and Bottou [18] have shown best practices of how to overcome these problems.

Among the solutions proposed for GAN improvements are some that prevent the generator from producing only too similar samples in one batch, some that keep the discriminator insecure about the true labels of real and fake examples, and more, which Creswell et al. [17] have summarized in their GAN overview. A collection of best practices compiled from these sources is presented in Box 4. It is almost impossible to write a cookbook for successful, converging, stable GAN training. For almost every tip, there is a caveat or situation where it cannot be applied. The suggestions below therefore are to be taken with a grain of salt but have been used by many authors successfully.

Box 4: Best Practices for Stable GAN Training

General measures. GAN training is sensitive to hyperparameters, most importantly the learning rate. Mode collapse might already be mitigated by a lower learning rate. Also, different learning rates for generator and discriminator might help. Other typical measures are batch normalization (or instance normalization in case of small batch sizes; mind however that batch normalization can taint the randomness of latent vector sampling and in general should not be used in combination with certain GAN loss functions), use of transposed convolutions instead of parameter-free upsampling, and strided convolutions instead of down-sampling.

Feature matching. One typical observation is that neither discriminator nor generator converges. They play their “cat-and-mouse” game too effectively. The generator produces a good image, but the discriminator learns to figure it out, and the generator shifts to another good image, and so on.

A remedy for this is feature matching, where the 2 distance between the average feature vectors of real and fake examples is computed instead of a cross-entropy loss on the logits. Because per batch the feature vectors change slightly, this introduces randomness that helps to prevent discriminator overconfidence.

Minibatch discrimination. When the generator only produces very convincing but extremely similar images, this is an indication for mode collapse.

This can be counteracted by calculating a similarity metric between generated samples and penalizing the generator for too little variation. Minibatch discrimination is considered to be superior in performance to feature matching.

One-sided label smoothing. Deep classification models often suffer from overconfidence, focusing on only very few features to classify an image. If this happens in a GAN, the generator might figure this out and only produce the feature the discriminator uses to decide for a real example.

A simple measure to counteract this is to provide not a “1” as a label for the real images in the batch but a lower value. This way, the discriminator is penalized for overconfidence (when it returns a value close to “1”).

Cost function selection. Several sources list possible GAN cost functions. Randomly trying them one by one might work, but often some of the above measures, in particular learning rate and hyperparameter tuning, might be more successful first steps.

Besides these methods, one area of discussion concerned the question if there is a need of balancing discriminator and generator learning and convergence at all. The argument was that a converged discriminator will as well yield a training signal to the generator as a non-converged discriminator. Practically, however, many authors described carefully designed update schedules, e.g., updating the generator once per a given number of discriminator updates.

Many more ideas exist: weight updating in the generator using an exponential moving average of previous weights to avoid “forgetting,” different regularization and conditioning techniques, and injecting randomness into generator layers anew. Some we will encounter later, as they have proven to be useful in more recent GAN architectures.

Despite the recent advances in stabilizing GAN training, even the basic method described so far, with the improvements made in the seminal DCGAN publication [20], finds application until today, e.g., for the de novo generation of PET color images [21]. The usefulness of an approach as presented in their publication might be doubted, since the native PET data is obviously not colored. The authors use 2D histograms of the three-color channel combinations to compare true and fake examples. As we have discussed earlier, this is likely a poor metric since it does not allow insights into the high-dimension joint probability distribution underlying the data-generating process. Figure 14 shows an example comparison of some generated examples compared to original PET images.

Fig. 14
Six P E T images of the brain are arranged in two rows. Row one contains three real images, each of which is an irregular shape with an oval border on a dark background. Row two contains three synthetic images, each of which is an irregular shape with an oval border on a dark background.

PET images generated from random noise using a DCGAN architecture. Image taken from [21] (CC-BY4.0)

To address many of the GAN training dilemmas, Arjovsky and Bottou [18] have proposed to employ the Wasserstein distance as a replacement for KL or JS divergence already in their examination of the root causes of poor GAN training results and have later extended this into their widely anticipated approach we will focus on next [22, 23]. We will also see more involved and recent approaches to stabilize and speed up GAN training in later sections of this chapter (Subheading 4).

3.6 Wasserstein GANs

Wasserstein GANs were appealing to the deep learning and GAN scene very quickly after Arjovsky et al.’s [22] seminal publication because of a number of traits their inventors claimed they’d have. For one, Wasserstein GANs are based on the theoretical idea that the change of the loss function to the Wasserstein distance should lead to improved results. This combined with the reported benchmark performance would already justify attention. But Wasserstein GANs additionally were reported to train much more stably, because, as opposed to previous GANs, the discriminator would be trained to convergence in every iteration, instead of demanding a carefully and heuristically found update schedule for generator and discriminator. In addition, the loss was directly reported to correlate with visual quality of generated results, instead of being essentially meaningless in a minimax game.

Wasserstein GANs are therefore worth an in-depth treatment in the following sections.

3.6.1 The Wasserstein (Earthmover) Distance

The Wasserstein distance figuratively measures how, with an optimal transport plan, mass can be moved from one configuration to another configuration with minimal work. Think, for example, of heaps of earth. Figure 15 shows two heaps of earth, P and Q (discrete probability distributions), both containing the same amount of earth in total, but in different concrete states x and y out of all possible states.

Fig. 15
6 discrete probability distributions of P and Q. The 3 P distributions on the left with P1 to P 5 have 1, 2, 3, 4, and 5 squares, respectively. Th 3 Q distributions are on the right, with q 4 squares in the middle and bottom with 1 and 2 dark shaded squares, and q 3 and q 1 squares with 1 and 2 blank squares.

One square is one shovel full of earth. Transporting the earth shovel-wise from pile to pile amasses performed work: the Wasserstein (earthmover) distance. The example shows a Wasserstein distance of DW = 7

Work is defined as the shovelfuls of earth times the distance it is moved. In the three rows of the figure, earth is moved (only within one of P or Q, not from one to the other), in order to make the configuration identical. First, one shovelful of earth is moved one pile further, which adds one to the Wasserstein distance. Then, two shovelfuls are moved three piles, adding six to the final Wasserstein distance of DW = 7.

Note that in an alternative plan, it would have been possible to move two shovelfuls of earth from p4 to p1 (costing six) and one from p4 to p3, which is the inverse transport plan of the above, executed on P, and leading to the same Wasserstein distance. The Wasserstein distance is in fact a distance, not a divergence, because it yields the same result regardless of the direction. Also note that we implicitly assumed that P and Q share their support,Footnote 3 but that in case of disjunct support, only a constant term would have to be added, which grows with the distance between the support regions.

Many other transport plans are possible, and others can be equally cheap (or even cheaper—it is left to the reader to try this out). Transport plans need not modify only one of the stocks but can modify both to reach the optimal strategy to make them identical. Algorithmically, the optimal solution to the question of the optimal transport plan can be found by formulating it as a linear programming problem. However, enumerating all transport plans and computing the linear programming algorithm are intractable for larger and more complex “heaps of earth.” Any nontrivial GAN will need to estimate transport of such complex “heaps,” so they suffer this intractability problem. Consequently, in practice, a different approach must be taken, which we will sketch below.Footnote 4

Formalizing the search for the optimal transport plan, we look at all possible joint distributions of our P and Q, forming the set of all possible transport plans, and denote this set Π(P, Q), implying that for all γ ∈ Π(P, Q), P and Q will be their marginal distributions.Footnote 5 This, in turn, means that by definition ∑xγ(x, y) = P(y) and ∑yγ(x, y) = Q(x).

For one concrete transport plan γ that works between a state x in P and a state y in Q, we are interested in the optimal transport plan γ(x, y). Let ∥x − y∥ be the Euclidian distance to shift earth between x and y, and then multiplying this with every value of γ (the amount of earth shifted) leads to

$$ {D}_{\mathrm{W}}\left(P,Q\right)=\underset{\gamma \in \Pi}{\operatorname{inf}}\kern0.3em \sum \limits_{x,y}\parallel x-y\parallel \gamma \left(x,y\right), $$

which can be rewritten to obtain

$$ {D}_{\mathrm{W}}\left(P,Q\right)=\underset{\gamma \sim \Pi \left(P,Q\right)}{\operatorname{inf}}{\mathbbm{E}}_{\left(x,y\right)\sim \gamma}\parallel x-y\parallel . $$
(8)

It measures both the distance of two distributions with disjunct support and the difference between distributions with perfectly overlapping support because it includes both, the shifting of earth and the distance to move it.

Practically, though, this result cannot be used directly, since the Linear Programming problem scales exponentially with the number of dimensions of the domain of P and Q, which are high for images. To our disadvantage, we additionally need to differentiate the distance function if we want to use it for deep neural network training using backpropagation. However, we cannot obtain a derivative from our distance function in the given form, since, in the linear programming (LP) formulation, our optimized distribution (as well as the target distribution) end up as constraints, not parameters.

Fortunately, we are not interested in the transport plan γ itself, but only in the distance (of the optimal transport plan). We can therefore use the dual form of the LP problem, in which the constraints of the primal form become parameters. With some clever definitions, the problem can be cast into the dual form, finally yielding

$$ {D}_{\mathrm{W}}\left(P,Q\right)=\underset{\parallel f{\parallel}_{L\le 1}}{\sup}\kern0.3em {\mathbbm{E}}_{x\sim P}f(x)-{\mathbbm{E}}_{x\sim Q}f(x) $$

with a function f that has to adhere to a constraint called the 1-Lipschitz continuity constraint, which requires f to have a slope of at most magnitude 1 everywhere. f is the neural network, and more specifically for a GAN, the discriminator network. 1-Lipschitzness can be achieved trivially by clipping the weights to a very small interval around 0.

3.6.2 Implementing WGANs

To implement the distance as a loss function, we rewrite the last result again as

$$ {D}_{\mathrm{W}}\left(P,Q\right)=\underset{w\in W}{\max }{\mathbbm{E}}_{x\sim P}\left[{D}_w(x)\right]-{\mathbbm{E}}_{z\sim Q}\left[{D}_w\left({G}_w(z)\right)\right]. $$
(9)

Note that in opposition to other GAN losses we have seen before, there is no logarithm anymore, because, this time, the “discriminator” is no longer a classification network that should learn to discriminate true and fake samples but rather serves as a “blank” helper function that during training learns to estimate the Wasserstein distance between the sets of true and fake samples.

Box 5: Spectral Normalization

Spectral normalization is applied to the weight matrices of a neural network to ensure a boundedness of the error function (e.g., Lipschitzness of the discriminator network in the WGAN context). This helps convergence like any other normalization method, as it provides a guaranty that gradient directions are stable around the current point, allowing larger step widths.

The spectral norm (or matrix norm) measures how far a matrix A can stretch a vector x:

$$ \left\Vert \boldsymbol{A}\right\Vert =\underset{\boldsymbol{x}\ne 0}{\max}\frac{\left\Vert \boldsymbol{A}\boldsymbol{x}\right\Vert }{\left\Vert \boldsymbol{x}\right\Vert } $$

The numerical value of the spectral norm of A can be shown to be just its maximum singular value. To compute the maximum singular value, an algorithmic idea helps: the power iteration method, which yields the maximal eigenvector.

Power iteration uses the fact that any matrix will rotate a random vector toward its largest eigenvector. Therefore, by iteratively calculating \( \frac{\boldsymbol{AX}}{\mid \boldsymbol{Ax}\mid } \), the largest eigenvector is obtained eventually.

In practice, it is observed that a single iteration is already sufficient to achieve the desired normalizing behavior.

Consequently, the key ingredient is the Lipschitzness constraint of the discriminator network,Footnote 6 and how to enforce this in a stable and regularized way. It soon turned out that weight clipping is not an ideal choice. Rather, two other methods have been proposed: the gradient penalty approach and normalizing the weights with the spectral norm of the weight matrices.

Both have been added to the standard catalogue of performance-boosting measures in GAN training ever since, where in particular spectral normalization (cf. Box 5) is attractive as it can be implemented very efficiently, has a sound theoretical and mathematical foundation, and ensures stable and efficient training.

3.6.3 Example Application: Brain Abnormality Detection Using WGAN

One of the first applications of Wasserstein GANs in a practical use case was presented in the medical domain, specifically in the context of attributing visible changes of a diseased patient with respect to a normal control to locations in the images [27]. The way this detection problem was cast into a GAN approach (and then solved with a Wasserstein GAN) was to delineate the regions that make the images of a diseased patient look “diseased,” i.e., find the residual region, that, if subtracted from the diseased-looking image, would make it look “normal.”

Figure 16 shows the construction of the VA-GAN architecture with images from a mocked dataset for illustration. For the authors’ results, see their publication and code repository.Footnote 7

Fig. 16
A schematic of V A-G A N architecture. A diseased computed tomography scan leads to another scan via generator U-net, then the diseased scan and feature attribution result in a modified diseased scan and control, critic, and W G A N loss.

An image of a diseased patient is run through a U-Net with the goal to yield a map that, if added to the input image, results in a modified image that fools the discriminator (“critique”) network into classifying it as a “normal” control. The map can be interpreted as the regions attributed to appear abnormal, giving rise to the name of the architecture: visual attribution GAN (VA-GAN)

For their implementation, the authors note that neither batch normalization nor layer normalization helped convergence and hypothesize that the difference between real and generated examples may be a reason that in particular batch normalization may in fact have an adverse effect especially during the early training phase. Instead, they impose an 1 norm loss component on the U-Net-generated “visual (feature) attribution” (VA) map to ensure it to be a minimal change to the subject. This serves to prevent the generator from changing the subject into some “average normal” image that it may otherwise learn. They employ an update regime that trains the critic network for more iterations than the generator, but doesn’t train it to convergence as proposed in the original WGAN publications. Apart from these measures, in their code repository, the authors give several practical hints and heuristics that may stabilize the training, e.g., using a tanh activation for the generator or exploring other dropout settings and in general using a large enough dataset. They also point out that the Wasserstein distance isn’t suited for model selection since it is too unstable and not directly correlated to the actual usefulness of the trained model.

This is one more reason to turn in the next section to an important topic in the context of validation for generative models: How to quantify their results?

3.7 GAN Performance Metrics

One imminent question has so far been postponed, though it implicitly plays a crucial role in the quest for “better” GANs: How to actually measure the success of a GAN or the performance in terms of result quality?

GANs can be adapted to solve image analysis tasks like segmentation or detection (cf. Subheading 3.6.3). In such cases, the quality and success can be measured in terms of task-related performance (Jaccard/Dice coefficient for segmentation, overlap metrics for detection etc.).

Performance assessment is less trivial if the GAN is meant to generate unseen images from random vectors. In such scenarios, the intuitive criterion is how convincing the generated results are. But convincing to whom? One could expose human observers to the real and fake images, ask them to tell them apart, and call a GAN better than a competing GAN if it fools the observer more consistently.Footnote 8 Since this is practically infeasible, metrics were sought that provide a more objective assessment.

The most widely used way to assess GAN image quality is the Fréchet inception distance (FID). This distance is conceptually related to the Wasserstein distance. It has an analytical solution to calculate the distance of Gaussian (normal) distributions. In the multivariate case, the Fréchet distance between two distributions X and Y  is given by the squared distance of their means μX (resp. μY) and a term depending on the covariance matrix describing their variances ΣX (resp. ΣY):

$$ d\left(X,Y\right)={\left\Vert {\mu}_X-{\mu}_Y\right\Vert}^2+\mathrm{Tr}\left({\Sigma}_X+{\Sigma}_Y-2\sqrt{\Sigma_X{\Sigma}_Y}\right). $$
(10)

The way this distance function is being used is often the score, which is computed as follows:

  • Take two batches of images (real/fake, respectively).

  • Run them through a feature extraction or embedding model. For FID, the inception model is used, pretrained on ImageNet. Retain the embeddings for all examples.

  • Fit each one multivariate normal distribution to the embedded real/fake examples.

  • Calculate their Fréchet distance according to the analytical formula in Eq. 10.

This metric has a number of downsides. Typically, if computed for a larger batch of images, it decreases, although the same model is being evaluated. This bias can be remedied, but FID remains the most used metric still. Also, if the inception network cannot capture the features of the data FID should be used on, it might simply be uninformative. This is obviously a grave concern in the medical domain where imaging features look much different from natural images (although, on the other hand, transfer learning for medical classification problems proved to work surprisingly well, so that apparently convolutional filters trained on photographs also extract applicable features from medical images). In any case, the selection of the pretrained embedding model brings a bias into the validation results. Lastly, the assumption of a multivariate normal distribution for the inception features might not be accurate, and only describing it through their means and covariances is a severe reduction of information. Therefore, a qualitative evaluation is still required.

One obvious additional question arises: If the ultimate metric to judge the quality of the generator is given by, for example, the FID, why can’t it be used as the optimization goal instead of minimizing a discriminator loss? In particular, as the Fréchet distance is a variant of the Wasserstein distance, an answer to this question is not obvious. In fact, feature matching as described in Box 4 exactly uses this type of idea, and likewise, it has been partially adopted in recent GAN architectures to enhance the stability of training with a more fine-grained loss component than a pure categorical cross-entropy loss on the “real/fake” classification of the discriminator.

Related recent research is concerned with the question how generated results can automatically be detected to counteract fraudulent authors. So-called forensic algorithms detect patterns that point out generated images. This research puts up the question how to detect fake images reliably. Solutions based on different analysis directions encompass image fingerprinting and frequency-domain analysis [28,29,30,31].

4 Selected GAN Architectures You Should Know

In the following, we will examine some GAN architectures and GAN developments that were taken up by the medical community or that address specific needs that might make them appealing, e.g., for limited data scenarios.

4.1 Conditional GAN

GANs cannot be told what to produce—at least that was the case with early implementations. It was obvious, though, that a properly trained GAN would imprint the semantics of the domain onto its latent space, which was evidenced by experiments in which the latent space was traversed and images of certain characteristics could be produced by sampling accordingly. Also, it was found that certain dimensions of the latent space can correspond to certain features of the images, like hair color or glasses, so that modifying them alone can add or take away such visible traits.

With the improved development of conditional GANs [32] following a number of GANs that modeled the conditioning input more explicitly, another approach was introduced that was based on the U-Net architecture as a generator and a favorable discriminator network that values local style over a full-image assessment.

Technically, the formulation of a conditional GAN is straightforward. Recalling the value function (learning objective) of GANs from Eq. 5,

$$ J\left(G,D\right)={\mathbbm{E}}_{x\sim {p}_{\mathrm{data}}}\left[\log D(x)\right]+{\mathbbm{E}}_{z\sim {p}_G}\left[1-\log D\left(G(z)\right)\right], $$

We now want to condition the generation on some additional knowledge or input. Consequently, both the generator G and the discriminator D will receive an additional “conditioning” input, which we call x. This can be a class label but also any other associated information. Very commonly, the additional input will be an image, as, for example, for image translation application (e.g., transforming from one image modality to another such as, for instance, MRI to CT). The result is the cGAN objective function:

$$ {J}_{\mathrm{cGAN}}\left(G,D\right)={\mathbbm{E}}_{x\sim {p}_{\mathrm{data}}}\left[\log D\left(x|y\right)\right]+{\mathbbm{E}}_{z\sim {p}_G}\left[1-\log D\left(G\left(z|y\right)\right)\right] $$
(11)

Isola et al. [32] describe experiments with MNIST handwritten digits, where a simple generator with two layers of fully connected neurons was used, and similarly for the discriminator. x was set to be the class label. In a second experiment, a CNN creates a feature representation of images, and the generator is trained to generate textual labels (choosing from a vocabulary of about 250.000 encoded terms) for the images conditioned on this feature representation.

Figure 17 shows a possible architecture to employ a cGAN architecture for image-to-image translation. In this diagram, the conditioning input is the target image that the trained network shall be able to produce based on some image input. The generator network therefore is a U-Net. The discriminator network can be implemented, for example, by a classification network. This network always receives two inputs: the conditioning image (x in Fig. 17) and either the generated output \( \hat{y} \) or the true paired image y.

Fig. 17
A schematic of c G A N. On top, the translated image of y circumflex equals G of x from conditioning image x through generator G, and at the bottom, the paired image y. The discriminator assigns values 0 or 1 in the top and bottom.

A possible architecture for a cGAN. Left: the generator network takes the base images x as input and generates a translated image \( \hat{y} \). The discriminator receives either this pair of images or a true pair x, y (right). The additional generator reconstruction loss (often a 1 loss) is calculated between y and \( \hat{y} \)

Note that the work of Isola et al. [32] introduces an additional loss term on the generator that measures the 1 distance between the generated and ground truth image, which is (with variables as in Eq. 11)

$$ {\mathcal{J}}_{\ell_1}(G)={\mathbbm{E}}_{\mathrm{x},\mathrm{y},\mathrm{z}}\parallel y-G\left(x,z\right){\parallel}_1, $$

where ∥⋅∥1 is the 1 norm.

The authors do not further justify this loss term apart from stating that 1 is preferred over 2 to encourage less blurry results. It can be expected that this loss component provides a good training signal to the generator when the discriminator loss doesn’t, e.g., in the beginning of the training with little or no overlap of target and parameterized distributions. The authors propose to give the 1 loss orders of magnitudes more weight than the discriminator loss component to value accurate translations of images over “just” very plausible images in the target domain.

The cGAN, namely, in the configuration with a U-Net serving as the generative network, was very quickly adopted by artists and scientists, thanks to the free implementation pix2pix.Footnote 9 One example created with pix2pix is given in Fig. 18, where the cGAN was trained to produce cat images from line drawings.

Fig. 18
A schematic of a pix 2 pix experiment. On the left, an input of a cat outline triggers the pix 2 pix process, which results in an output of a cat with hairs and facial features.

Input and output of a pix2pix experiment. Online demo at https://affinelayer.com/pixsrv/

One application in the medical domain was proposed, for example, by Senaras et al. [33]. The authors used a U-Net as a generator to produce a stained histopathology image from a label image that has two distinct labels for two kinds of cell nuclei. Here, the label image is the conditioning input to the network. Consequently, the discriminator network, a classification CNN tailored to the patch-based classification of slides, receives two inputs: the histopathology image and a label image.

Another example employed an augmented version of the conditional GAN to translate CT to MR images of the brain, including a localized uncertainty estimate about the image translation success. In this work, a Bayesian approach to model the uncertainty was taken by including dropout layers in the generator model [34].

Lastly, a 3D version of the pix2pix approach with a 3D U-Net as a generative network was devised to segment gliomas in multimodal brain MRI using data from the 2020 International Multimodal Brain Tumor Segmentation (BraTS) challenge [35]). The authors called their derived model vox2vox, alluding to the extension to 3D data [36].

More conditioning methods have been developed over the years, some of which will be sketched further on. It is common to this type of GANs that paired images are required to train the network.

4.2 CycleGAN

While cGANs require paired data for the gold standard and conditioning input, this is often hard to come by, in particular in medical use cases. Therefore, the development of the CycleGAN set a milestone as it alleviates this requirement and allows to train image-to-image translation networks without paired input samples.

The basic idea in this architecture is to train two mapping functions between two domains and to execute them in sequence so that the resulting output is considered to be in the origin domain again. The output is compared against the original input, and their 1 or 2 distance establishes a novel addition to the otherwise usual adversarial GAN loss. This might conceptually remind one of the autoencoder objectives: reproduce the input signal after encoding and decoding; only this time, there is no bottleneck but another interpretable image space. This can be exploited to stabilize the training, since the sequential concatenation of image translation functions, which we will call G and F, can be reversed. Figure 19 shows a schematic of the overall process (left) and one incarnation of the cycle, here from image domain X to Y  and back (middle).

Fig. 19
A schematic of the G A N cycle. On the left, x leads to y via G, y leads to x via F, and x and y give D x and D y, respectively. In the middle, x leads to G of the x tilde y hat via G, then to the x hat via F and D y. On the right, x has two blocks of j c y c of G, F leads to the a block of y.

Cycle GAN. Left: image translation functions G and F convert between two domains. Discriminators DX and DY give adversarial losses in both domains. Middle: for one concrete translation of an image x, the translation to Y  and back to X is depicted. Right: after the translation cycle, the original and back-translated result are compared in the cycle consistency loss

CycleGANs employ several loss terms in training: two adversarial losses \( \mathcal{J}\left(G,{D}_Y\right) \) and \( \mathcal{J}\left(F,{D}_X\right) \) and two cycle consistency losses, of which one \( {\mathcal{J}}_{\mathrm{cyc}}\left(G,F\right) \) is indicated rightmost in Fig. 19. Zhu et al. [37] presented the initial publication with a participation of the cGAN author Isola [37]. The cycle consistency losses are 1 losses in their implementation, and the GAN losses are least square losses instead of negative log likelihood, since more stable training was observed with this choice.

Almahairi et al. [38] provided an augmented version [38], noting that the original implementation suffers from the inability to generate stochastic results in the target domain Y  but rather learns a one-to-one mapping between X and Y  and vice versa. To alleviate this problem, the generators are conditioned on one latent space each for both directions, so that, for the same input x ∈ X, G will now produce multiple generated outputs in Y  depending on the sample from the auxiliary latent space (and similarly in reverse). Still, \( F \) has to recreate a \( \hat{x} \) minimizing the cycle consistency loss for each of these samples. This also remedies a second criticism brought forward against vanilla CycleGANs: these networks can learn to hide information in the (intermediate) target image domain that fool the discriminator but help the backward generator to minimize the cycle consistency loss more efficiently [39]. Chu et al. [39] use adaptive histogram equalization to show that in visually empty regions of the intermediate images information is present. This is a finding reminiscent of adversarial attacks, which the authors elaborate on in their publication.

Zhang et al. [40] show a medical application. In their work, a CycleGAN has been used to train image translation and segmentation models on unpaired images of the heart, acquired with MRI and CT and with gold standard expert segmentations available for both imaging datasets. The authors proposed to learn more powerful segmentation models by enriching both datasets with artificially generated data. To this end, MRIs are converted into CT contrast images and vice versa using GANs. Segmentation models for MRI and CT are then trained on dataset consisting of original images and their expert segmentations and augmented by the converted images, for which expert segmentations can be carried over from their original domain. To achieve this, it is of importance that the converted (translated) images accurately depict the shape of the organs as expected in the target domain, which is enforced using the shape consistency loss.

In the extended setup of the CycleGAN with shape and cycle consistency, three different loss types instead of the original two are combined during training:

Adversarial GAN losses \( {\mathcal{J}}_{\mathrm{GAN}} \).:

This loss term is the same as defined, e.g., in Eq. 5.

Cycle consistency losses \( {\mathcal{J}}_{\mathrm{cyc}} \).:

This is the 1 loss presented by the original CycleGAN authors discussed above.

Shape consistency losses \( {\mathcal{J}}_{\mathrm{shape}} \).:

The shape consistency loss is a new addition proposed by the authors. A cross-correlation loss takes into account two segmentations, the first being the gold standard segmentation mx for an x ∈ X and one segmentation produced by a segmenter network S that was trained on domain Y  and receives the translated image \( \hat{y}=G(x) \).

Figure 20 depicts the three loss components, of which the first two are known already from Fig. 19.

Fig. 20
A schematic of the G A N cycle. On the left, x leads to y, and y leads to x via G and F, x and y give D x and D y, then it leads to j G A N. In the middle, x consists of 2 blocks of j c y c of G, F leads to a block in y. On the right, x has a block of m x leads to a block in y via G, and y to a shape via s.

Cycle GAN with shape consistency loss (rightmost part of figure). Note that the figure shows only one direction to ease readability

Note that the description as well as Fig. 20 only show one direction for cycle and shape consistency loss. Both are duplicated into the other direction and combined into the overall training objective, which then consists of six components.

In several other works, the CycleGAN approach was extended and combined with domain adaption methods for various segmentation tasks and also extended to volumetric data [41,42,43].

4.3 StyleGAN and Successor

One of the most powerful image synthesis GANs to date is the successor of StyleGAN, StyleGAN2 [44, 45]. The authors, at the time of writing researching at Nvidia, deviate from the usual GAN approach in which an image is generated from a randomly sampled vector from a latent space. Instead, they use a latent space that is created by a mapping function f which is in their architecture implemented as a multilayer perceptron which maps from a 512-dimensional space Z into a 512-dimensional space W. The second major change consisted of the so-called adaptive instance normalization layer, AdaIN, which implements a normalization to zero-mean and unit variance of each feature map, followed by a multiplicative factor and an additive bias term. This serves to reweight the importance of feature maps in one layer. To ensure the locality of the reweighting, the operation is followed by the nonlinearity. The scaling and bias are two components of y = (ys, yb), which is the result of a learnable affine transformation A applied to a sample from W.

In their experiments, Karras et al. [44] recognized that after these changes, the GAN actually no longer depended on the input vector drawn from W itself, so the random latent vector was replaced by a static vector fed into the GAN. The y, which they call styles, remained to be results from a vector randomly sampled from the new embedding space W.

Lastly, noise is added in each layer, which serves to allow the GAN to produce more variation without learning to produce it from actual image content. The noise, like the latent vector, is fed through learnable transformations B, before it is added to the unnormalized feature maps. The overall architecture is sketched in Fig. 21.

Fig. 21
A schematic of style G A N architecture. Latent z belongs to z, which leads to normalizing, followed by dense in F, w belongs to w, which leads to 4 blocks of A at the bottom, and noise leads to 4 blocks of B on the top. A, block leads to AdaI N via style of y s, y b, and B block is added to unnormalized feature maps.

StyleGAN architecture, after [44]. Learnable layers and transformations are shown in green, the AdaIN function in blue

In the basic setup, one sample is drawn from W and fed through per-layer learned A to gain per-layer different interpretations of the style, y = (ys, yb). This can be changed, however, and the authors show how using one random sample w1 in some of the layer blocks and another sample w2 in the remaining; the result will be a mixture of styles of both individual samples. This way, the coarse attributes of the generated image can stem from one sample and the fine detail from another. Applied to a face generator, for example, pose and shape of the face are determined in the coarse early layers of the network, while hair structure and skin texture are the fine details of the last layers. The architecture and results gained widespread attention through a website,Footnote 10 which recently was followed up by further similar pages. Results are depicted in Fig. 22.

Fig. 22
Five photographs. A person with a hat and eye glasses, an artwork with different shades, a cat with wide open eyes, a person on a horse, and a chemical structure against a dark background are presented from left to right.

Images created with StyleGAN; https://this{person—artwork—cat—horse—chemical}doesnotexist.com. Last accessed: 2022-01-14

The crucial finding in StyleGAN was that the mapping function F transforming the latent space vector from \( Z \) to \( W \) serves to ensure a disentangled (flattened) latent space. Practically, this means that if interpolating points zi between two points z1 and z2 drawn from \( \mathcal{Z} \) and reconstructing images from these interpolated points zi, semantic objects might appear (in a StyleGAN-generating faces, for example, a hat or glasses) that are neither part of the generated images from the first point z1 nor the second point z2 between which it has been interpolated. Conversely, if interpolating in \( W \), this “semantic discontinuity” is no longer the case, as the authors show with experiments in which they measure the visual change of resulting images when traversing both latent spaces.

In their follow-up publications, the same authors improve the performance even further. They stick to the basic architecture but redesign the generative network pertaining to the AdaIN function. In addition, they add their metric from [44] that was meant to quantify the entanglement of the latent space as a regularizer. The discriminator network was also enhanced, and the mechanisms of StyleGAN that implement the progressive growing have been successively replaced by more performance-efficient setups. In their experiments, they show a growth of visual and measured quality and removal of several artifacts reported for StyleGAN [45].

4.4 Stabilized GAN for Few-Shot Learning

GAN training was very demanding both regarding GPU power, in particular for high-performance architectures like StyleGAN and StyleGAN2, and, as importantly, availability of data. StyleGAN2, for example, has typical training times of about 10 days on a Nvidia 8-GPU Tesla V100. The datasets comprised at least tens of thousands of images and easily orders of magnitude more. Particularly in the medical domain, such richness of data is typically hard to find.

The authors of [46] propose simple measures to stabilize the training of a specific GAN architecture, which they design from scratch using a replacement for residual blocks, arranged in an architecture with very few convolutional layers, and a loss that drives the discriminator to be less certain when it gets closer to convergence. In sum, this achieves very fast training and yields results competitive with prior GANs [46] and outperforming them in low-data situations.

The key ingredients to the architecture are shortcut connections in the generator model that rescale feature maps of higher resolution with learnable weights derived from low resolutions. The effect is to make fine details simultaneously more independent of direct predecessor feature maps and yet ensure consistency across scales.

A random seed vector of length 256 enters the first block (“Up Conv”), where it is upscaled to a 256 × 4 × 4 tensor. In Fig. 23, the further key blocks of the architecture are “upsample” and “SLE” blocks.

Upsample:

blocks consist of a nearest-neighbor upsampling followed by a 3 × 3 convolution, batch normalization, and nonlinearity.

Fig. 23
A schematic of a fast G A N generator network. A block of conv of 256 by 1 by 1, to 5 up samples through 4, 8, 16, 32, and 64 squares. Up sample leads to S L E via 128 squares, then up sample to S L E, and so on until it reaches conv 3 times 3 Tan h and outputs an image.

The FastGAN generator network. Shortcut connections through feature map weighting layers (called skip-layer excitation, SLE) transport information from low-resolution feature maps into high-resolution feature maps. For details regarding the blocks, see text

SLE:

blocks (seen in the top right inset in the architecture diagram) don’t touch the incoming high-resolution input (entering from top into the block) but comprise a pooling layer that in each SLE block is set up to yield a 4 × 4 stack of feature maps, followed by a convolution to reduce to a 1 × 1 tensor, which is then in a 1 × 1 convolution brought to the same number of channels as the high-resolution input. This vector is then multiplied to the channels of the high-resolution input.

Secondly, the architecture introduces a self-supervision feature in the discriminator network. The discriminator network (see Fig. 24) is a simple CNN with strided convolutions in each layer, halving resolution in each feature map. In the latest (coarsest) feature maps, simple up-scaling convolutional networks are attached that generate small images, which are then compared in loss functions (\( {\mathcal{J}}_{\mathrm{recon}} \) in Fig. 24) to down-sampled versions of the real input image. This self-supervision of the discriminator is only performed for real images, not for generated ones.

Fig. 24
A schematic of the fast G A N mechanism. An image of 1024 power 2 leads to a down sample and crop, then to an image of 128 power 2. An image with 1024 power 2 leads to the down conv block, then to the residual block with conv blocks A and B having 256 power 2, and finally to conv block c and output logits.

The FastGAN self-supervision mechanism of the discriminator network. Self-supervision manifests through the loss term indicated by the curly bracket between reconstructions from feature maps and resampled/cropped versions of the original real image, \( {\mathcal{J}}_{\mathrm{recon}} \)

The blocks in the figure spell out as follows:

Down Conv Block:

consists of two convolutional layers with strided 4 × 4 convolutions, effectively reducing the resolution from 10242 to 2562.

Residual Blocks:

have two sub-items, “Conv Block A” being a strided 4 × 4 convolution to half resolution, followed by a padded 3 × 3 convolution. “Conv Block B” consists of a strided 2 × 2 average pooling that quarters resolution, followed by a 1 × 1 convolution, so that both blocks result in identically shaped tensors, which are then added.

Conv Block C:

consists of a 1 × 1 convolution followed by a 4 × 4 convolution without strides or padding, so that the incoming 82 feature map is reduced to 52.

Decoder:

The decoder networks are four blocks of upsampling layers each followed by 3 × 3 convolutions.

The losses employed in the model are the discriminator loss consisting of the hinge version of the usual GAN loss, with the added regularizing reconstruction loss between original real samples and their reconstruction, and the generator loss plainly being \( {\mathcal{J}}_{\mathrm{G}}={\mathbbm{E}}_{z\sim Z}\left[D\left(G(z)\right)\right] \).

The model is easy to train on modest hardware and little data, as evidenced by own experiments on a set of about 30 chest CTs (about 2500 image slices, converted to RGB). Figure 25 shows randomly picked generated example slices, roughly arranged by anatomical content. It is to be noted that organs appear mirrored in some images. On the other hand, no color artifacts are visible, so that the model has learned to produce only gray scale images. Training time for 50,000 iterations on a Nvidia TitanX GPU was approximately 10 hours.

Fig. 25
Fifteen computed tomography scans of the fast G A N arranged in three rows. Each row contains five different scans, with the abnormalities highlighted in light shade over a dark background.

FastGAN as implemented by the authors has been used to train a CT slice generative model. Images are not cherry-picked, but arranged by similar anatomical regions

4.5 VQGAN

In a recent development, a team of researchers combined techniques for text interpretation with a dictionary of elementary image elements feeding into a generative network. The basic architecture component that is employed goes back to vector quantization variational autoencoders (VQ-VAE), where the latent space is no longer allowed to be continuous, but is quantized. This allows to use the latent space vectors in a look-up table: the visual elements.

Figure 26 was created using code available online, which demonstrates how images of different visual styles can be created using the combination of text-based conditioning and a powerful generative network.

Fig. 26
An image of the V Q GAN + CLIP combination. It has a drawing of grasses, flowers, and various animals.

The VQGAN+CLIP combination creates images from text inputs, here: “A child drawing of a dark garden full of animals”

The basis for image generation is the VQGAN (“vector quantization generative adversarial network”) [47], which learns representations of input images that can later steer the generative process, in an adversarial framework. The conditioning is achieved with the CLIP (“Contrastive Image-Language Pretraining”) model that learns a discriminator that can judge plausible images for a text label or vice versa [48].

The architecture has been developed with an observation in mind that puts the benefits and drawbacks of convolutional and transformer architectures in relation to each other. While the locality bias of convolutional architectures is inappropriate if overall structural image relations should be considered, it is of great help in capturing textural details that can exist anywhere, like fur, hair, pavement, or grass, but where the exact representation of hair positions or pavement stones is irrelevant. On the other hand, image transformers are known to learn convolutional operators implicitly, posing a severe computational burden without a visible impact on the results. Therefore, Esser et al. [47] suggest to combine convolutional operators for local detail representation and transformer-based components for image structure.

Since the VQGAN as a whole is no longer a pure CNN but for a crucial component uses a transformer architecture, this model will be brought up again briefly in Subheading 5.2.

The VQGAN architecture is derived from the VQ-VAE (vector quantization variational autoencoder) [49], adding a reconstruction loss through a discriminator, which turns it into a GAN. At the core of the architecture is the quantization of estimated codebook entries. Among the quantized entries in the codebook, the closest entry to the query vector coding, an image patch is determined. The found codebook entry is then referred to by its index in the codebook. This quantization operation is non-differentiable, so for end-to-end training, gradients are simply copied through it during backpropagation.

The transformer can then efficiently learn to predict codebook indices from those comprising the current version of the image, and the generative part of the architecture, the decoder, produces a new version of the image. Learning expressive codebook entries is enforced by a perceptual loss that punishes inaccurate local texture, etc. Through this, the authors can show that high compression levels can be achieved—a prerequisite to enable efficient, yet comprehensive, transformer training.

5 Other Generative Models

We have already seen how GANs were not the first approach to image generation but have prevailed for a time when they became computationally feasible and in consequence have been better understood and improved to accomplish tasks in image analysis and image generation with great success. In parallel with GANs, other fundamentally different generative modeling approaches have also been under continued development, most of which have precursors from the “before-GAN” era as well. To give a comprehensive outlook, we will sketch in this last section the state of the art of a selection of these approaches.Footnote 11

5.1 Diffusion and Score-Based Models

Diffusion models take a completely different approach to distribution estimation. GANs implicitly represent the target distribution by learning a surrogate distribution. Likelihood-based models like VAE approximate the target distribution explicitly, not requiring the surrogate. In diffusion models, however, the gradient of the log probability density function is estimated, instead of looking at the distribution itself (which would be the unfathomable integral of the gradient). This value is known as the Stein score function, leading to the notion that diffusion models are one variant of score-based models [53].

The simple idea behind this class of models is to revert a sequential noising process. Consider some image. Then, perform a large number of steps. In each step, add a small amount of noise from a known distribution, e.g., the normal distribution. Do this until the result is indistinguishable from random noise.

The denoising process is then formulated as a latent variable model, where T − 1 latents successively progress from a noise image \( {\mathbf{x}}_T\sim \mathcal{N}\left({\mathbf{x}}_T;\mathbf{0},\mathbf{I}\right) \) to the reconstruction that we call x0 ∼ q(x0). The reconstructed image, x0, is therefore obtained by a reverse process qθ(x0:T). Note that each step in this chain can be evaluated in closed form [54]. Several model implementations of this approach exist, one being the deep diffusion probabilistic model (DDPM). Here, a deep neural network learns to perform one denoising step given the so-far achieved image and a t ∈{1, …, T}. Iterative application of the model to the result of the last iteration will eventually yield a generated image from noise input.

Autoregressive diffusion models (ARDMs) [55] follow yet another thought model, roughly reminiscent of PixelRNNs we have briefly mentioned above (see Subheading 3.2). Both share the approach to condition the prediction of the next pixel or pixels on the already predicted ones. Other than in the PixelRNN, however, the specific ARDM proposed by the authors does not rely on a predetermined schedule of pixel updates, so that these models can be categorized as latent variable models.

As of late, the general topic of score-based methods, among which diffusion models are one variant, received more attention in the research community, fueled by a growing body of publications that report image synthesis results that outperform GANs [53, 56, 57]. Score function-based and diffusion models superficially share the similar concept of sequentially adding/removing noise but achieve their objective with very different means: where score function-based approaches are trained by score-matching and their sampling process uses Langevin dynamics [58], diffusion models are trained using the evidence lower bound (ELBO) and sample with a decoder, which is commonly a neural network. Figure 27 visualizes an example for a score function.

Fig. 27
An illustration of the Stein score function. It has light shaded concentric horizontal and vertical ovals with many arrows in each direction pointing to the ovals, indicate the score function on a dark background.

The Stein score function can be conceived of as the gradient of the log probability density function, here indicated by two Gaussians. The arrows represent the score function

Score function-based (sometimes also score-matching) generative models have been developed to astounding quality levels, and the recent works of Yang Song and others provide accessible blog posts,Footnote 12 and a comprehensive treatment of the subject in several publications [53, 58, 59].

In the work of Ho et al. [54], the stepwise reverse (denoising) process is the basis of the denoising diffusion probabilistic models (DDPM). The authors emphasize that a proper selection of the noise schedule is crucial to fast, yet high-quality, results. They point out that their work is a combination of diffusion probabilistic models with score-matching models, in this combination also generalizing and including the ideas of autoregressive denoising models. In an extension of Ho et al.’s [54] work by Nichol and Dhariwal [57], an importance sampling scheme was introduced that lets the denoising process steer the most easy to predict next image elements. Equipped with this new addition, the authors can show that, in comparison to GANs, a wider region of the target distribution is covered by the generative model.

5.2 Transformer-Based Generative Models

The basics of how attention mechanisms and transformer architectures work will be covered in the subsequent chapter on this promising technology (Chapter 6). Attention-based models, predominantly transformers, have been used successfully for some time in sequential data processing and are now considered the superior alternative to recurrent networks like long-short-term memory (LSTM) networks. Transformers have, however, only recently made their way into the image analysis and now also the image generation world. In this section, we will only highlight some developments in the area of generative tasks.

Google Brain/Google AI’s 2018 publication on so-called image transformers [60], among other tasks, shows successful conditional image generation for low-resolution input images to achieve super-resolution output images, and for image inpainting, where missing or removed parts of input images are replaced by content produced by the image transformer.

OpenAI have later shown that even unmodified language transformers can succeed to model image data, by dealing in sheer compute power for hand modeling of domain knowledge, which was the basis for the great success of previous unsupervised image generation models. They have trained Image GPT (or iGPT for short), a multibillion parameter language transformer model, and it excels in several image generation tasks, though only for fairly small image sizes [61]

In the recent past, StyleSwin has been proposed by Microsoft Research Asia [62], enabling high-resolution image generation. However, the approach uses a block-wise attention window, thereby potentially introducing spatial incoherencies at block edges, which they have to correct for.

“Taming transformers” [47], another recent publication already mentioned above, uses what the authors call a learned template code book of image components, which is combined with a vector quantization GAN (VQGAN). The VQGAN is structurally modeled after the VQ-VAE but adds a discriminator network. A transformer model in this architecture composes these code book elements and is interrogated by the GAN variational latent space, conditioned on a textual input, a label image, or other possible inputs. The GAN reconstructs the image from the so-quantized latent space using a combination of a perceptual loss assessing the overall image structure and a patch-based high-resolution reconstruction loss. By using a sliding attention window approach, the authors prevent patch border artifacts known from StyleSwin. Conditioning on textual input makes use of parts of the CLIP [48] idea (“Contrastive Language-Image Pretraining”), where a language model was train in conjunction with an image encoder to learn embeddings of text-image pairs, sufficient to solve many image understanding tasks with competitive precision, without specific domain adaption.

It is evidenced by the lineup of institutions that training image transformer models successfully is nothing that can be achieved with modest hardware or on even a medium-scale image database. In particular for the medical area, where data is comparatively scarce even under best assumptions, the power of such models will only be available in the near future if domain transfer learning can be successfully achieved. This, however, is a known strength of transformer architectures.