Abstract
Purpose
In the rapidly expanding field of artificial intelligence (AI) there is a wealth of literature detailing the myriad applications of AI, particularly in the realm of deep learning. However, a review that elucidates the technical principles of deep learning as relevant to radiation oncology in an easily understandable manner is still notably lacking. This paper aims to fill this gap by providing a comprehensive guide to the principles of deep learning that is specifically tailored toward radiation oncology.
Methods
In light of the extensive variety of AI methodologies, this review selectively concentrates on the specific domain of deep learning. It emphasizes the principal categories of deep learning models and delineates the methodologies for training these models effectively.
Results
This review initially delineates the distinctions between AI and deep learning as well as between supervised and unsupervised learning. Subsequently, it elucidates the fundamental principles of major deep learning models, encompassing multilayer perceptrons (MLPs), convolutional neural networks (CNNs), recurrent neural networks (RNNs), transformers, generative adversarial networks (GANs), diffusion-based generative models, and reinforcement learning. For each category, it presents representative networks alongside their specific applications in radiation oncology. Moreover, the review outlines critical factors essential for training deep learning models, such as data preprocessing, loss functions, optimizers, and other pivotal training parameters including learning rate and batch size.
Conclusion
This review provides a comprehensive overview of deep learning principles tailored toward radiation oncology. It aims to enhance the understanding of AI-based research and software applications, thereby bridging the gap between complex technological concepts and clinical practice in radiation oncology.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Radiation therapy stands as one of the cornerstones of the multidisciplinary management of cancer, harnessing targeted ionizing radiation to eradicate malignant cells while sparing surrounding healthy tissue. As with many other fields in medicine, the advent and rapid evolution of artificial intelligence (AI) technologies, in particular deep learning, promise to revolutionize the landscape of radiation oncology, thereby fostering improvements in treatment planning, accuracy, personalization, and patient outcomes [1, 2]. AI’s potential in radiation oncology is vast, from delineating tumors to optimizing radiation dosage, predicting responses, and monitoring potential side effects [3]. However, for modern-day radiooncologists and medical physics experts, navigating the complexities of AI principles, methodologies, and applications can be daunting. This review endeavors to bridge this knowledge gap, providing a comprehensive and accessible guide to the principles of AI as they pertain to radiation therapy. By understanding the intersection of AI with radiation oncology, practitioners can be better positioned to harness these technologies for improved patient care in the future.
In the rapidly expanding world of AI, there exists a wealth of literature detailing intricate deep learning techniques and their myriad applications [4, 5]. While these comprehensive resources offer in-depth perspectives, they are often tailored to audiences with a strong foundation in machine learning and computational sciences. Many such publications can be labyrinthine for clinicians and medical professionals, potentially hindering the effective assimilation of this knowledge into clinical practice. Recognizing this gap, the present review has been meticulously crafted with radiooncologists and medical physics experts in mind. Our aim is to demystify complex AI methodologies, thus offering a digestible and clinically relevant overview that allows for the seamless integration of AI into the realm of radiation therapy. This guide stands as a bridge between the sophisticated world of deep learning and the practical necessities of the radiooncology clinic.
General principles
Artificial intelligence, machine learning, deep learning, and radiomics
The relationships between AI, machine learning, deep learning, and radiomics are displayed in Fig. 1. AI includes everything from simple, rule-based algorithms to complex, learning-based systems. A program which can sense, reason, act, and adapt can be regarded as AI. Machine learning represents a subset of AI technologies. It refers to the ability of machines to learn from data. Deep learning represents a subset of machine learning techniques. It involves neural networks with many layers (hence “deep”) that can learn from data. Radiomics [6] is a process to extract computational imaging features of any kind from medical images with the purpose of predicting medical endpoints such as treatment effectiveness and potential side effects [7], which has become a well-known concept in radiation oncology. The conventional concept of radiomics refers to the specific classical workflow using predefined features such as PyRadiomics features [8] and the conventional classification and regression algorithms like support vector machines (SVMs) [9]. Since comparable imaging features can be extracted by convolutional neural networks (CNNs) via convolution [10], the broader concept of radiomics has nowadays extended from conventional machine learning into deep learning, as illustrated in Fig. 1.
General components of training a deep learning model
In deep learning, the training process is essential for developing models that can make accurate predictions or decisions based on data. This process involves several key components: epochs, batches, learning rate, optimizers, loss functions, overfitting, and validation, as well as separate datasets for training, validation, and testing. An epoch represents a full pass through the entire training dataset, where the model updates its weights. Training data are often divided into smaller subsets called batches, which are processed sequentially during an epoch. The learning rate determines the step size at each iteration while moving toward a minimum of the loss function, influencing how quickly a model learns. Optimizers [11] are algorithms or methods used to change the attributes of the neural network, such as weights and learning rate, to reduce the losses. A loss function measures how well the model’s predictions correspond to the actual values, and minimizing this loss is crucial for learning from the data. Overfitting occurs when a model learns not just the underlying pattern but also the noise in the training data, leading to poor performance on unseen data. Techniques like regularization are used to prevent this, and the model’s performance is continually assessed on a validation dataset to tune parameters and select the best model iteration. After training and validation, the model is evaluated on a test dataset, providing a final measure of the model’s performance in real-world scenarios. Effective management of these components enhances the model’s ability to learn accurately and predict reliably.
Supervised vs. unsupervised learning
Supervised and unsupervised learning are two foundational paradigms in machine learning. The key distinction between supervised and unsupervised learning is whether labels are used for training the AI model. Supervised learning involves training models on labeled datasets, whereby both the input data and the corresponding desired outputs are provided. For example, to train a brain metastasis segmentation model, input medical images and their corresponding metastasis segmentation labels are necessary during the training phase [12]. In contrast, unsupervised learning does not require explicit labels for training. Instead, it delves into the inner features of the data, aiming to unveil hidden structures or patterns. Examples include clustering patients into groups based on similar tumor profiles to understand disease subtypes [13] or using dimensionality-reduction techniques to represent complex patient data in a more interpretable manner [14].
Fundamentally, the distinguishment between supervised and unsupervised learning relies on the loss/objective function: if the loss function definition requires a label, it is supervised learning; otherwise, it is unsupervised learning. For example, training a brain metastasis segmentation model [12] can use the following binary cross entropy (BCE) loss function:
where \(p\) is the network output probability of each pixel being a metastasis and \(y\) is its corresponding classification label (1 for metastasis and 0 for normal tissue). Hence, it is supervised learning. Instead, training a network to reduce noise in computed tomography (CT) images [15] with the following total variation (TV) loss is an example of unsupervised learning:
where \(\boldsymbol{f}\) is the input image and \(i\) is the pixel index. The nabla operator \(\nabla\) is a gradient operator which calculates the pixel intensity differences between the \(i-\)th pixel and its neighboring pixels. The \(\ell_{2}\) norm \(||\cdot||_{2}\) represents the magnitude of the change. The definition of TV is the sum of all absolute changes in pixel intensity. TV employs the intrinsic sparsity nature of clean medical images [16] and, hence, no additional labels are required.
Another example of unsupervised learning is using an autoencoder network [17, 18] to reconstruct its input. Autoencoder networks have ample applications and can be used for image-denoising applications (like CT/cone-beam CT or MRI improvement in radiotherapy) [19] but also for abnormality detection (e.g., in radiotherapy plans) [20] or for deriving deep radiomics features from image data among other things [21]. The \(\mathcal{L}_{2}\) loss can be used for training the autoencoder:
where \(\mathcal{M}\) is the autoencoder model and \(\mathcal{M}(\boldsymbol{f})\) is the model output with the input image \(\boldsymbol{f}\). As the autoencoder aims to restore the original input image as accurately as possible, no additional labels are required.
Model explainability
Deep learning models, particularly those involving complex architectures like deep neural networks, are often regarded as “black boxes” due to their intricate structures and the opaqueness of their decision-making processes. To address this, many methods have been developed to enhance their explainability. Feature importance methods such as Shapley Additive exPlanations (SHAP) [22] and the feature importance ranking measure (FIRM) [23] determine the importance of each feature in the decision-making process. Feature visualization methods like saliency maps [24] and gradient-weighted class activation mapping (Grad-CAM) [25] provide a heat map in the input image to highlight the input regions which are most relevant to the model’s decision. Transformer-based networks [26] with multihead self-attention mechanisms reveal the important visual or textual regions intrinsically by the attention weights. Please refer to survey papers of [27, 28] for such model explainability methods.
The above methods try to interpret models with complex architectures. Another direction is to build networks based on known mathematical or physical operators. Such networks are known as physics-informed neural networks (PINNs) [29, 30]. Since such networks use data-driven methods to improve the precision of conventional methods, the concept is also called “precision learning” [31]. For example, networks can be built based on the mathematical reaction–diffusion equations for glioma growth modeling [32, 33].
Types of deep learning models
In the field of radiation oncology, recent advancements have been significantly driven by deep learning algorithms, which represent the forefront of modern AI applications. Consequently, the following section introduces various types of deep learning models.
Multilayer perceptron (MLP)
Before the advent of convolutional neural networks (CNNs) [34], artificial neural networks mainly referred to multilayer perceptrons (MLP), which are also called fully connected neural networks. The basic component of an MLP is called a perceptron [35], which resembles a biological neuron (Fig. 3a) in human or animal nervous systems. A perceptron applies a nonlinear activation (resembling the nonlinear all-or-none principle of a biological neuron, in which the magnitude of the action potential is independent of the magnitude of the input stimulus as long as the threshold is reached) to the weighted sum or combination of input feature values (corresponding to the dendrites integrating the synaptic input in a biological neuron). Therefore, mathematically, a perceptron (“artificial neuron”) simply corresponds to a linear regression followed by a nonlinear activation, as displayed in Fig. 3b. In radiation oncology, linear regression is commonly used:
where \(\bar{y}\) is the linear regression output, \(\boldsymbol{x}^{\top}=[1,x_{1},x_{2},{\ldots},]\)\(x_{k}]\) is the feature vector (superscript \(\top\) denotes the transpose of \(\boldsymbol{x}\)), \(\boldsymbol{w}=[w_{0},w_{1},w_{2},{\ldots},w_{k}]\) is the weight vector, and \(\boldsymbol{w}\cdot\boldsymbol{x}\) is the vector dot product of both vectors. By convention and throughout this review, bold lower-case symbols (e.g., \(\boldsymbol{w}\)) refer to vectors, bold upper-case symbols (e.g., \(\boldsymbol{W}\)) refer to matrices, and non-bold italic symbols refer to scalar values. Note that the bias \(w_{0}\) has already been merged into the weight vector \(\boldsymbol{w}\). A perceptron is hence mathematically defined as
where \(\delta\) is a nonlinear activation function. Here, only one output value \(y\) is generated. To generate multiple output values as a vector \(\boldsymbol{y}\), different weight vectors can be stacked row by row to form a weight matrix \(\boldsymbol{W}\), i.e.,
In a perceptron, various types of nonlinear activation functions can be used. These functions are crucial because they introduce nonlinearity into the model, enabling the network to learn more complex relationships in the data. Some commonly used nonlinear functions in perceptrons include step function (defined as \(\delta(x)=( \textrm{sign}(x)+1)/2\)), sigmoid/logistic function (\(\delta(x)=\frac{1}{1+e^{-x}}\)) [36], hyperbolic tangent function (tanh; \(\delta(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}\)) [37], rectified linear unit (ReLU; \(\delta(x)=\max(0,x)\)) [38], leaky ReLU (\(\delta(x)=\max(x,\alpha x)\), where \(\alpha\) is a small positive number such as 0.01) [39], and the swish function (\(\delta(x)=x\cdot\frac{1}{1+e^{-\beta x}}\), where the scaling parameter \(\beta\) can be learned) [40]. Such functions are also widely employed in modern networks, with the ReLU function being the most frequently used due to its piecewise linearity [41], which has advantages for network optimization and interpretability.
An MLP consists of an input layer, an output layer, and multiple hidden layers. Each hidden layer consists of multiple perceptrons. The universal approximation theorem [42] states that theoretically, an MLP with a single, infinitely wide hidden layer can approximate arbitrary functions, which is the theoretical foundation of AI. In practice, however, an infinitely wide hidden layer is impossible. Hence, increasing layer depth or number is a practical way to increase network representation power [43], which has empirically proven to be extremely effective. Increasing the number of layers enabled new levels of network performance and novel applications that were previously unachievable with shallower network architectures. Because of this, the term “deep” in “deep learning” mainly refers to the network depth. One intuitive explanation for the increased performance with increasing network depth is that increasing the number of layers allows the network to learn features at different levels of abstraction.
Convolutional neural networks (CNNs)
General principles of CNNs
CNNs have risen as a pivotal cornerstone [34], especially when dealing with image data. Like MLPs, we call the basic component of CNNs a convolutional perceptron. It also applies a nonlinear activation function to the weighted combination of input feature values. The difference is that a convolutional perceptron sums up a neighborhood of input values which is determined by the convolution kernel size or window (Fig. 4b), whereas a perceptron in MLPs sums up all the input values as a fully connected node (Fig. 4a). Fundamentally, a convolutional perceptron is the same as a regular perceptron, but with the weights set to 0 for input values outside the convolution window. Another key difference is that the same convolutional perceptron with the same weights is used to generate the next output value by simply moving the convolution kernel position within the image, as displayed in Fig. 4c, whereas a new perceptron with different weights is necessary for an MLP. In the example of Fig. 4, to generate one output feature from a \(5\times 5\) image requires training of 25 weights and, hence, generation of a \(5\times 5\) feature map requires \(25\times 25\) weights for an MLP. In stark contrast, for a CNN, only 9 weights must be trained to generate a \(5\times 5\) feature map (Fig. 4d. The number of weights differs drastically for medical images with a size of \(512\times 512\), where an MLP requires \(512^{4}\) weights to generate a \(512\times 512\) feature map, but a CNN requires the same 9 weights to generate a \(512\times 512\) feature map. Therefore, CNNs are more efficient for image processing because they reduce the number of parameters (from \(512^{4}\) to 9 for one feature map in this example) through weight sharing. For network training, too many parameters in a model relative to the quantity of available training data will cause overfitting. Because of the lower number of parameters, CNNs are less prone to overfitting than MLPs.
Characterized by their unique convolutional layers, CNNs are adept at automatically extracting hierarchical features from images. Since one convolution kernel can only extract limited features, a convolutional layer typically uses multiple kernels to generate multiple feature maps, as displayed in Fig. 4e, which form different feature channels for subsequent convolutions. A deep CNN typically consists of tens of convolutional layers, which initially extract low-level features such as edges and progressively discern highly abstract features in the deeper layers [45,46,47]. https://arxiv.org/pdf/1311.2901.pdf (Fig. 2) in [45] provides a good visual illustration of extracted features in different layers of a CNN for computer vision tasks. Fig. 5 in the current work exemplifies the feature maps of different layers in the DeepMedic [44] network for glioma segmentation. Such automatic spatial feature extraction is particularly invaluable for tasks in radiooncology imaging like image enhancement [48,49,50], organ segmentation [51,52,53], and others [1, 54].
Representative networks of CNN
Convolutional autoencoders Autoencoders [17, 18] are commonly used to compress high-dimensional data into a low dimensional latent representation. An autoencoder consists of an encoder, a bottleneck, and a decoder. A typical convolutional autoencoder is displayed in Fig. 6a. The encoder extracts the features of the input data via a sequence of convolution, nonlinear activation, and pooling (downsampling) operations. The bottleneck is a low dimensional latent representation of the input data, e.g., a \(1\,\times\,1\,\times\,512\) vector. The decoder applies another sequence of convolution, nonlinear activation, and upsampling operations to restore the input data from the latent representation.
In Fig. 6a, each blue arrow has a \(3\times 3\) zero-padded convolution operation, a ReLU operation, and a batch- normalization operation. Batch normalization is a technique used in neural networks that normalizes the inputs of each layer using the mean and variance of the values in the current batch, thereby improving training speed and stability by reducing the internal covariate shift [55]. The convolution operations extract features at different levels. The ReLU operations generate nonlinear responses. Batch normalization allows neural networks to use higher learning rates and be less sensitive to initialization. The red arrow stands for the max-pooling operation. Max pooling downsamples feature maps by using the maximum value to represent a \(2\times 2\) neighborhood. Because of the downsampling operations, the receptive field of \(3\times 3\) convolutions is enlarged and large-scale features can be extracted. The green arrow represents an upsampling operation. In particular, a bilinear upsampling with a factor of 2 followed by a \(2\times 2\) convolution is used.
U‑Net U‑Net [56,57,58] is the most widely used network for biomedical image processing, including tumor segmentation [59, 60]. It is called “U”-Net because of its U‑shaped architecture design. A typical U‑Net architecture is displayed in Fig. 6b. The left part of the U‑Net is a contraction path, while the right part is an expansion path, both following the typical architectures of CNNs: a sequence of convolution, nonlinear activation, and pooling (i.e., down- or upsampling) operations. U‑Net can be regarded as a modified version of a convolutional autoencoder, where features from the encoder are copied and concatenated to the decoder at different feature levels. The copy operations are represented by the horizontal dashed arrows in Fig. 6b. The concatenation operations stack the features from the encoder and the decoder along the channel dimension (width \(\times\) height \(\times\) channel for 2D image processing). The copy and concatenation operations allow the U‑Net to employ multilevel features more effectively than the convolutional autoencoder, as fine-scale features lost in the bottleneck can be recovered gradually in the expansion path.
As U‑Nets are very powerful, many different U‑Net variants have been proposed which are characterized by different network topologies, parameters, and modifications that may have included over-optimization to specific datasets without broad generalizability. The nnU-Net or “no new U‑Net” [58] is a particularly important U‑Net-based segmentation pipeline for biomedical segmentation that includes rule-based parameter settings based on dataset characteristics, thus enabling out-of-the-box application to numerous segmentation problems and representing a reference for deep learning-based autosegmentation.
Others Many well-known CNNs are currently widely used in various applications. Others include LeNet (the first CNN) [34], AlexNet (the CNN that made deep learning popular) [61], visual geometry group (VGG) networks (commonly used for perceptual loss) [62], GoogLeNet [63], and ResNet (residual learning for better performance) [64].
Recurrent neural networks (RNNs)
General principles of RNNs
While CNNs shine in the domain of spatial data (such as 2D and 3D medical images), recurrent neural networks (RNNs) [65] bring their prowess to sequential data (like text, audio, or tumor respiratory motion information). The MLPs and CNNs introduced above are feed-forward networks, where the information moves in only one direction (from the input layer through multiple hidden layers to the output layer). In contrast, RNNs have network connections that form directed cycles. This structure allows RNNs to maintain a “memory” of previous inputs by incorporating their own output as part of the input for the subsequent step.
As displayed in Fig. 7, a regular feed-forward MLP network can be converted to an RNN by adding a directed loop. In the illustration \(\boldsymbol{x}\) is the input layer and \(\boldsymbol{y}\) is the output layer. The hidden layers in the feed-forward networks are compressed into one middle layer containing the memory state \(\boldsymbol{h}\) in the RNN. A loop is added to \(\boldsymbol{h}\) to memorize past time-dependent information. \(A\), \(B\), and \(C\) are corresponding network parameters. The unrolled RNN architecture in Fig. 8 illustrates how an RNN processes sequential data. The key part is that at time \(t\), the memory state \(\boldsymbol{h}_{t}\) is first updated according to the old internal state \(\boldsymbol{h}_{t-1}\) and the current input \(\boldsymbol{x}_{t}\) with the parameter \(C\), i.e., \(\boldsymbol{h}_{t}=f_{C}(\boldsymbol{h}_{t-1},\boldsymbol{x}_{t})\). With this relationship, prior temporal information is captured. With \(\boldsymbol{h}_{t}\) and \(\boldsymbol{x}_{t}\), the current output \(\boldsymbol{y}_{t}\) is predicted with the network parameters \(A\).
In the following, a simple example is described for better understanding. Let us consider a simple MLP network with one hidden layer:
where \(\boldsymbol{W}_{xh}\) is the weight matrix between the input layer and the hidden layer, \(\boldsymbol{W}_{hy}\) is the weight matrix between the hidden layer and the output layer, \(\tanh\) is the specified nonlinear tanh activation function, and \(\sigma\) is the nonlinear sigmoid function. An RNN converted from the above MLP network can be represented as
where \(\boldsymbol{x}_{t}\), \(\boldsymbol{y}_{t}\), and \(\boldsymbol{h}_{t}\) are time-dependent input, output, and memory state, respectively, with \(\boldsymbol{h}_{0}=\boldsymbol{0}\). \(\boldsymbol{W}_{hh}\) is the weight matrix between the old memory state \(\boldsymbol{h}_{t-1}\) and the current memory state \(\boldsymbol{h}_{t}\), which is the added loop connection. Note that the network parameters \(\boldsymbol{W}_{hy}\), \(\boldsymbol{W}_{xh}\), and \(\boldsymbol{W}_{hh}\) do not change over time at the inference phase and they correspond to \(A\), \(B\), and \(C\) in Fig. 7, respectively.
Representative networks of RNN
Long short-term memory (LSTM) networks To train RNNs, the output errors need to be propagated back to the input through the network layers as well as through time [66]. Backpropagation through time is the regular backpropagation algorithm applied to RNNs with a well-defined order. Because of the repeating application of the chain rule in computing gradients over time, a small gradient will decrease exponentially, leading to the vanishing gradient problem [67]. Therefore, the training is dominated by (the gradients of) early time steps. In addition, think about the challenge of guessing the final word in the following text generation task: “I grew up in Germany…(other long sentences here)… I speak fluent \(\_\_\_\).” The latest part of the sentence indicates that the missing word is likely a language. However, to accurately determine which language it is, we must consider the earlier mentioning of Germany. This earlier context is crucial to making a precise prediction, which requires long-term dependency. To solve the above two problems, long short-term memory (LSTM) networks [68, 69] and gated recurrent units (GRUs) [70] were invented.
LSTM [68, 69] introduces three gates to control the information flow, i.e., the forget gate \(\boldsymbol{f}_{t}\), the input gate \(\boldsymbol{i}_{t}\), and the output gate \(\boldsymbol{o}_{t}\). They are mathematically defined as
where \(\sigma\) is the nonlinear sigma activation function, which maps the internal memory state \(\boldsymbol{h}_{t-1}\) and the current input \(\boldsymbol{x}_{t}\) into a scalar value between 0 and 1. Hence, these gates can control the information flow between fully off (0) and fully on (1). Compared with the standard RNN definition in Eq. (8), the LSTM definition is characterized as follows:
where \(\tilde{\boldsymbol{c}}_{t}\) is the candidate cell state, which is a standard RNN like Eq. (8). The previous cell state \(\boldsymbol{c}_{t-1}\) and the candidate cell state \(\tilde{\boldsymbol{c}}_{t}\) are combined with the forget gate \(\boldsymbol{f}_{t}\) and the input gate \(\boldsymbol{i}_{t}\) to form the current cell state \(\boldsymbol{c}_{t}\), where \(\otimes\) is the pointwise multiplication. The new hidden memory state \(\boldsymbol{h}_{t}\) is formed by applying the output gate \(\boldsymbol{o}_{t}\) to the activated cell state \(\boldsymbol{c}_{t}\). With \(\boldsymbol{h}_{t}\), the current output is obtained via \(\boldsymbol{y}_{t}=\sigma(\boldsymbol{W}_{hy}\cdot\boldsymbol{h}_{t})\).
LSTM networks address the vanishing gradient problem through their unique cell state (\(\boldsymbol{c}_{t}\)) design, which allows gradients to flow unchanged, thus ensuring stable training over many time steps. Additionally, they overcome long-term dependency issues with the three gates that regulate information flow, enabling the network to retain or forget information selectively, thereby efficiently capturing long-term relationships in data. (Note that GRU [70] is a simpler alternative to LSTM using two gates only, as displayed in Fig. 9c.) LSTM networks have been successfully applied to various radiation oncology tasks [71], e.g., for predicting respiratory signal using RNNs for thoracoabdominal tumor radiotherapy [72] and for predicting glioma growth [73, 74].
Transformers
General principles of transformers
Transformers have redefined the boundaries of what can be achieved by deep learning. Transformers can weigh the importance of different parts of the input data, regardless of the distance between them, thus rendering them beneficial in natural language processing (NLP) and other sequence-related tasks. Central to transformers is the mechanism of tokenization and the innovative use of self-attention mechanisms characterized by the variables query (\(\boldsymbol{Q}\)), key (\(\boldsymbol{K}\)), and value (\(\boldsymbol{V}\)) [26].
Tokenization is the first step in processing data using a transformer model. In this step, input data, typically text, is converted into tokens. Each token is a piece of the original data, like a word or part of a word, that can be individually processed. These tokens are then converted into numerical vectors using embedding techniques like Word2vec [75, 76]. Because text tokens can be embedded into numerical vectors, a complete input text can be represented by a matrix \(\boldsymbol{X}\), which is a stack of such token vectors. If the text has \(n\) tokens and each token is represented by a \(d\)-dimensional vector, \(\boldsymbol{X}\) would be an \(n\times d\) matrix.
The core innovation of the transformer architecture is the self-attention mechanism, which allows the model to weigh the importance of different tokens in a sequence relative to each other. The mechanism computes this attention using three main variables: query (\(\boldsymbol{Q}\)), key (\(\boldsymbol{K}\)), and value (\(\boldsymbol{V}\)). These are derived from the input tokens and play crucial roles in shaping the output of the attention process. Query (\(\boldsymbol{Q}\)) is a set of vectors that is used to probe the sequence. It represents the current token in context and looks for relevant information across the sequence. Key (\(\boldsymbol{K}\)) is a set of vectors that corresponds to each token in the sequence, against which the queries are compared. The relationship between a query and all keys determines the weighting of significance for each token in the sequence. Once the model determines the tokens to focus on (as guided by the strength of the query–key relationships), the value vectors (\(\boldsymbol{V}\)) of those tokens are aggregated to form the output of the attention mechanism.
To create the \(\boldsymbol{Q}\), \(\boldsymbol{K}\), and \(\boldsymbol{V}\) matrices, three sets of weight matrices need to be initialized: \(\boldsymbol{W}^{\boldsymbol{Q}}\), \(\boldsymbol{W}^{\boldsymbol{K}}\), and \(\boldsymbol{W}^{\boldsymbol{V}}\). These are trainable parameters of the model and are usually learned during the training process. The dimensions of these matrices depend on the desired dimensionality of the \(\boldsymbol{Q}\), \(\boldsymbol{K}\), and \(\boldsymbol{V}\) vectors (denoted as \(d_{k}\) for \(\boldsymbol{Q}\) and \(\boldsymbol{K}\) and as \(d_{v}\) for V). Typically, \(\boldsymbol{W}^{\boldsymbol{Q}}\), \(\boldsymbol{W}^{\boldsymbol{K}}\), and \(\boldsymbol{W}^{\boldsymbol{V}}\) are dimensions of \(d\times d_{k}\), \(d\times d_{k}\), and \(d\times d_{v}\), respectively. With the input matrix \(\boldsymbol{X}\) and the trainable weight matrices, the \(\boldsymbol{Q}\), \(\boldsymbol{K}\), and \(\boldsymbol{V}\) matrices are computed by matrix multiplication, i.e., \(\boldsymbol{Q}=\boldsymbol{X}\boldsymbol{W}^{\boldsymbol{Q}}\), \(\boldsymbol{K}=\boldsymbol{X}\boldsymbol{W}^{\boldsymbol{K}}\), and \(\boldsymbol{V}=\boldsymbol{X}\boldsymbol{W}^{\boldsymbol{V}}\).
Given the computed \(\boldsymbol{Q}\), \(\boldsymbol{K}\), and \(\boldsymbol{V}\) matrices, the attention scores between all pairs of queries and keys are calculated, typically using a scaled dot-product attention module [26] defined as the following:
where a division of \(\sqrt{d}\) is used for rescaling to get stable gradient computation during training. The softmax function is defined as
The softmax function normalizes all the vector elements to the range [0, 1] and the sum of normalized elements to 1, like probabilities, which is beneficial to keep the scale of the attention output. This scaled dot-product attention module is illustrated in Fig. 10b.
In Fig. 10b, an optional mask operation is added. This is necessary for some tasks like text generation, where the output sentence of an attention module is related to the previously generated words but not to words generated in the future. Therefore, a mask is necessary to avoid attention to unavailable context.
In order to fully employ the features in the sentence from different aspects (e.g., different semantic connections or syntactic structure), multiple copies of the scaled dot-product attention modules are assembled together to form a multihead attention, which is illustrated in Fig. 10c. Such multihead attention blocks become essential components of modern transformer networks. Because of the attention mechanism, transformer networks are capable of global context awareness. This global context awareness, together with introducing auxiliary techniques as well as increasing the quantity and quality training data, has boosted the development of NLP models, leading to the prosperity of language models including Bidirectional Encoder Representations from Transformers (BERT) [78] and Generative Pre-Trained Transformer‑4 (GPT-4), which are promising for decision-making support in radiation oncology [79,80,81].
The potential of the transformer architecture extends beyond NLP to computer vision tasks and medical image analysis [82], thanks to the development of the vision transformer (ViT) architecture [83]. Transformers have been applied to enhance the analysis of medical images, such as MRI and CT scans [84,85,86,87]. These applications capitalize on the global context awareness of transformers to reveal subtle patterns indicative of pathological conditions or treatment response. Noteworthy applications in radiation oncology [86, 87] include the TransUNet work [86] in which transformers were employed to improve the accuracy of tumor segmentation in CT scans, and the Multi-transSP work in which transformers are used to extract multimodal information for survival prediction. Both applications showcase their potential in aiding radiation oncologists in treatment planning and assessment.
Compared to RNNs, transformers have advantages in parallelization and long-distance dependencies: (i) unlike RNNs, which process data sequentially, transformers process entire sequences simultaneously. This allows for significantly more efficient training because the computations can be parallelized, making transformers particularly well suited for modern computing hardware like GPUs. (ii) Transformers use self-attention mechanisms to weigh the importance of each part of the input data relative to each other part, regardless of their positions in the sequence. This allows them to capture long-range dependencies more effectively than RNNs, which can struggle with such dependencies due to vanishing and exploding gradient issues. Because of these factors, transformers have largely overshadowed RNNs in many advanced applications. Nevertheless, RNNs are not obsolete. Since transformers typically require a large amount of training data and are computationally expensive, RNNs with a simpler architecture continue to have their importance in resource-constrained environments. Moreover, transformers process sequential data in a pseudotemporal manner, whereas the inherent sequential nature of RNNs is more appropriate for tasks where the sequence order strictly matters. Therefore, some hybrid networks such as Transformer-XL [88] and fast autoregressive transformers [89] that integrate RNNs and transformers together have been proposed.
Representative networks of transformers
Original transformer model The typical transformer architecture, introduced by Vaswani et al. for NLP tasks like language translation [26], is structured as an encoder–decoder framework, as illustrated in Fig. 11.
The encoder has one input (e.g., the source language text in translation tasks) and one output to the decoder (keys and values for the decoder). The input layer consists of two parts: text embedding and positional encoding. The input text is first embedded into a vector representation using embedding techniques like Word2vec [75, 76], so that it can be processed by networks. Unlike RNNs, which process words one after another in a sequential manner, transformers can process multiple words simultaneously in parallel, which requires positional encoding to tell the position of each word. Like the embedding of words, a vector (constructed by sine and cosine values of different frequencies) [26] is used to represent the position of each word and this positional vector is directly added to the word vector as the input of the encoder. The main body of the encoder consists of \(N\) (e.g., \(N=6\)) copies of the same modules, where each module shares the same architecture but with different weights.
The decoder has two inputs: one from the encoder and the other being the previous output of the decoder (e.g., the already generated target language text in translation tasks, which corresponds to an autoregressive property). Enabled by its self-attention, the encoder undertakes the processing of the input sequence, thereby producing a contextualized representation. This functionality allows the encoder to model interactions among different segments of the input sequence. The decoder then utilizes this representation to generate output predictions. The attention mechanisms (self-attention present in both the encoder and decoder as well as cross-attention connecting each encoder and decoder) play a crucial role in the model efficiently capturing long-range dependencies and contextual information, addressing issues like the vanishing gradient problem associated with traditional RNNs.
Vision transformer (ViT) ViT [83], as illustrated in Fig. 12, presents a broadly adopted modification of the transformer architecture in computer vision tasks. In this model, images are processed as sequences of patches (one complete image is cut into small patches and stored in an order), which are then linearly embedded to be passed through transformer encoder layers. The integration of the transformer approach into computer vision highlights its ranging utility. Based on ViT, many other transformer networks are being developed for medical imaging processing, including TransUNet [86], Swin-UNet [90], and Swin-UNETR [91].
Generative adversarial networks (GANs)
General principles of GANs
Generative adversarial networks (GANs) have achieved huge success since they were introduced by J. Goodfellow et al. [92]. GANs are a special type of neural network consisting of two neural networks trained simultaneously: one generator \(G\) and one discriminator \(D\). The generator and the discriminator engage in a two-player min–max game. The generator tries to produce data that look as real as possible, while the discriminator tries to get better at distinguishing real data from fake data. The process continues iteratively until equilibrium is reached, where the generator produces data almost identical to real data and the discriminator cannot differentiate between them.
The vanilla GAN’s architecture [92] is displayed in Fig. 13. The input of the generator, \(z\), is a random signal with a prior distribution \(p(z)\), e.g., a Gaussian or uniform distribution. The output of the generator, \(x_{g}\), which has a distribution of \(p_{g}(x)\), should be close to a real sample \(x_{r}\) from a distribution \(p_{r}(x)\). The objective of the discriminator is to tell whether an input signal is a real sample or a generated fake one, while the generator is trained to confuse the discriminator as much as possible.
Representative networks of the GAN type
In addition to image generation from noise (Fig. 13), GANs are capable of image-to-image translation, i.e., producing an image given another image (conditional GANs) [93,94,95]. The most widely known GANs for image-to-image translation are Pix2pixGAN [93] and CycleGAN [94]. Pix2pixGAN requires paired images for image translation, whereas CycleGAN allows unpaired images. In Fig. 14, the architecture of Pix2pixGAN in the application of CT image processing [50] is illustrated: the generator \(G\), which is typically a U-Net [56], converts a corrupted CT image into an artifact-free image. The discriminator \(D\) learns to distinguish the output image from the target image conditioned on the given input image.
Due to the great power of GANs in image generation, they have been widely used in various medical applications [95], including the field of radiation oncology. For example, GANs are commonly applied for synthetic CT generation from CBCT [96,97,98] or MRI images [99, 100]. GANs can also be applied for image segmentation and registration in prostate cancer radiotherapy [101].
Diffusion-based generative models
General principles of diffusion-based generative models
Diverging from traditional generative models, diffusion-based generative approaches [102,103,104,105,106] model data generation as a stochastic process, akin to particles undergoing random diffusion in a fluid. By reversing this process, such models can generate new data instances by starting from a noise sample and gradually refining it through iterative steps. A general pipeline of diffusion-based generative models is displayed in Fig. 15, where stochastic noise is added to an image in the forward diffusion process multiple times (500 times in the example) to corrupt the data slowly into random noise. For image generation, a random noise sample is picked and a reversed diffusion process is performed to gradually denoise a noisy image into a clean image.
The forward diffusion process follows a stochastic differential equation (SDE) in the following form [102]:
where \(\boldsymbol{f}(\boldsymbol{x},t)\) is the drift coefficient, \(\boldsymbol{g}(t)\) is the diffusion coefficient, and \(\boldsymbol{w}\) represents the standard Brownian motion (i.e., a random vector in magnitude and direction). While \(\boldsymbol{f}(\boldsymbol{x},t) \textrm{d}t\) describes the deterministic change of an image over time, \(\boldsymbol{g}(t) \textrm{d}\boldsymbol{w}\) describes the stochastic change of an image over time.
As the reverse of a diffusion process is also a diffusion process, the reverse-time SDE process has the following form [102, 107]:
where \(\bar{\boldsymbol{w}}\) is the standard Brownian motion when time flows backwards, which is implemented in the same way as the forward Brownian motion, as a random process; \(p_{t}(\boldsymbol{x})\) is the data distribution probability at time \(t\) and \(\nabla_{x}\log p_{t}(\boldsymbol{x})\) is the gradient of the data distribution over the space; and \(\nabla_{\boldsymbol{x}}\log p_{t}(\boldsymbol{x})\) is called the score function in score-based generative models [103, 108]. It tells a model in which direction to go to increase the probability of data. Conventionally, the data distribution probability needs to be normalized, i.e., \(p^{\prime}_{t}(\boldsymbol{x})=\frac{p_{t}(\boldsymbol{x})}{s}\), to let all probabilities sum up to 1, where \(s\) is the integral/sum of \(p_{t}(\boldsymbol{x})\). With the logarithm operation and gradient operation in the score function \(\nabla_{\boldsymbol{x}}\log p_{t}(\boldsymbol{x})\), score-based models do not have to be normalized because \(\nabla_{\boldsymbol{x}}\log p_{t}(\boldsymbol{x})=\nabla_{\boldsymbol{x}}\log p^{\prime}_{t}(\boldsymbol{x})-C\), where \(C\) is a constant value which does not change the optimization problem. Therefore, the formulation of the score function provides huge flexibility in choosing model architectures.
The above reverse-time SDE shares the same marginals as the following ordinary differential equation (ODE), namely the probability flow ODE [102]:
where the stochastic Brownian motion is removed and, hence, the ODE is deterministic.
In diffusion-based generative models, a network can learn the score function \(\nabla_{x}\log p_{t}(\boldsymbol{x})\) to guide the reverse diffusion process. Typically, a noise-conditional or time-conditional U‑Net with attention mechanisms [103] is applied to learn the score function. With the learned score function, the reverse-time SDE and the probability flow ODE can both be solved by various numerical methods, such as numerical SDE solvers [102, 109], numerical ODE solvers [102, 105, 110, 111], annealed Langevin dynamics [103], and predictor–corrector methods [102].
The applications of diffusion models in radiation oncology are being explored [112]. They can be applied to multicontrast MRI image translation and MRI–CT translation [113], fast MRI reconstruction [114, 115], brain MRI image synthesis for training brain tumor segmentation models [116], and diffusion MRI denoising [117].
Representative networks of diffusion-based generative models
Denoising diffusion-based probabilistic models (DDPMs) DDPMs [104, 118] represent the first well-known diffusion-based generative models. DDPMs are a special discrete form of SDE diffusion models (Eq. (13)) with the following specific forward SDE [102]:
and the corresponding reverse SDE form
DDPMs [104, 118] are constructed by two Markov chains: a forward chain that perturbs data to noise and a reverse chain that converts noise back to data. A Markov process assumes that the next state of a process only depends on the present state and not on the past states. Given an initial clean image \(\boldsymbol{x}_{0}=\boldsymbol{x}\), typically, Gaussian perturbation is applied in the forward Markov chain with the following transition kernel:
where \(\mathcal{N}\) denotes the Gaussian distribution, \(\boldsymbol{I}\) is an identity vector, and \(\beta_{t}\in(0,1)\) is a preset hyperparameter for training. It means that at each pixel index \((i,j)\) of the 2D image \(\boldsymbol{x}_{t}\), a random Gaussian noise with the mean value of \(\sqrt{1-\beta_{t}}\boldsymbol{x}_{t-1}(i,j)\) and the variance of \(\beta_{t}\) is added. Denoting \(\alpha_{t}:=1-\beta_{t}\) and \(\bar{\alpha}_{t}:=\Pi_{s=0}^{t}\alpha_{s}\), Eq. (18) leads to the following:
Therefore, the relationship between \(\boldsymbol{x}_{t}\) and \(\boldsymbol{x}_{0}\) is
where \(\boldsymbol{\epsilon}\) is a standard Gaussian noise \(\boldsymbol{\epsilon}\thicksim\mathcal{N}(0,\boldsymbol{I})\). When \(T\) is large enough (\(T\) is the total number of time steps in the forward process), we have \(\bar{\alpha}_{T}\approx 0\) and \(\boldsymbol{x}_{T}\) becomes pure Gaussian noise (which is the latent representation of the image).
For the reverse process, which is image generation, we need to learn the reverse Markov chain in the following form:
where \(\boldsymbol{\theta}\) is the model parameter set and \(\boldsymbol{\mu}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t},t)\) and \(\boldsymbol{\Sigma}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t},t))\) are mean and variance vectors parameterized by deep neural networks, respectively. With such a learnable reverse transition kernel, an image can be generated from Gaussian noise.
Denoising diffusion implicit models (DDIMs) DDIMs [105] extend the original DDPMs to non-Markovian processes with the following forward process:
where \(\boldsymbol{x}_{t}\) depends on not only \(\boldsymbol{x}_{t-1}\) but also \(\boldsymbol{x}_{0}\). Therefore, this is a non-Markovian process. The reverse process is tractable when conditioned on the initial image \(\boldsymbol{x}_{0}\):
The above formulation covers DDPMs and DDIMs as special cases, where DDPMs correspond to setting \(\sigma_{t}^{2}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}\), whereas DDIMs correspond to setting \(\sigma_{t}^{2}=0\). Since \(\sigma_{t}^{2}=0\), no stochastic noise is added and the reverse process is fully deterministic for DDIMs. In other words, a model can be trained with an arbitrary number of forward steps but only some intermediate steps are necessary in the reverse process for image generation. Therefore, DDIMs can accelerate image generation with similar image quality.
Conditional latent diffusion models Conditional latent diffusion models (LDMs) [119, 120] harness the power of a diffusion process within a latent space rather than directly in the high-dimensional pixel space to generate images conditionally based on various inputs like text or structured labels. In the latent space, the diffusion process can be mathematically modeled by a Markov chain of latent variables \(\{\boldsymbol{z}_{t}\}_{t=0}^{T}\), where \(\boldsymbol{z}_{0}\) is derived from the data distribution and \(\boldsymbol{z}_{T}\) is typically noise from a known distribution such as a Gaussian distribution. The transition from \(\boldsymbol{z}_{t-1}\) to \(\boldsymbol{z}_{t}\) is governed by a Gaussian transition probability, formally expressed as
where \(\beta_{t}\) is a variance schedule that guides the addition of noise over the diffusion step \(t\) and \(\boldsymbol{I}\) represents the identity matrix. This step-by-step noising process allows for an effective representation and manipulation of the data in its latent form, significantly optimizing computational efficiency.
To reverse this process for image generation, conditional LDMs employ a neural network parameterized as \(\boldsymbol{\theta}\) to model the reverse diffusion from noise to data. This is described by the conditional probability \(p_{\boldsymbol{\theta}}(\boldsymbol{z}_{t-1}|\boldsymbol{z}_{t},\boldsymbol{c})\), which is approximated by the network learning to denoise the data:
In the above equation, \(\boldsymbol{\mu}_{\boldsymbol{\theta}}\) and \(\boldsymbol{\Sigma}_{\boldsymbol{\theta}}\) are functions learned during training and \(\boldsymbol{c}\) represents the conditioning variables such as text descriptions or other labels. Compared with Eq. (21), Eq. (25) works in the latent space \(\boldsymbol{z}\) instead of in the original data space \(\boldsymbol{x}\), with the additional condition \(\boldsymbol{z}\). This mathematical framework allows LDMs to perform tasks such as text-to-image synthesis, class-conditional image generation, and even complex video synthesis with remarkable fidelity [120].
Deep reinforcement learning
General principles of deep reinforcement learning
Reinforcement learning [121,122,123] is a general framework within which agents learn to perform actions in an environment to maximize a reward. Unlike supervised learning, reinforcement learning needs to learn with highly delayed supervised information (e.g., success or failure of the decision is available only after multiple time steps) and has to deal with sequential decisions [123]. The two main components of reinforcement learning are the environment, which represents the problem to be solved, and the agent, which represents the AI model. The agent has a policy \(\pi_{\boldsymbol{\theta}}(s,a)\) to determine which action \(a\) to take based on the current state \(s\). This action will interact with the environment, and the environment will provide feedback (reward) to the agent to adjust the agent’s behaviors based on the defined immediate reward function \(r\). The goal of reinforcement learning is to improve the policy \(\pi_{\boldsymbol{\theta}}(s,a)\) to maximize the accumulated rewards. Reinforcement learning can be separated into two general categories [122]: model-free and model-based algorithms. Model-free reinforcement learning algorithms do not create a parameterized model of the environment’s transition function to make predictions of future states and rewards. In real applications, such a parameterized model is typically not available. Therefore, model-free reinforcement learning algorithms are generally more useful in practice than model-based ones. Model-free algorithms can further be categorized into value-based [124, 125] and policy-based [126] methods, which determine the optimal reward value and policy, respectively.
For deep reinforcement learning, an agent is typically a deep neural network (DNN) with a parameter set \(\boldsymbol{\theta}\), which is used to estimate the optimal reward value or policy given the current state \(s\), as displayed in Fig. 16. A policy network directly maps the states of the environment to actions. The network’s output is a policy, which dictates the probability distribution of selecting each possible action from a given state, e.g., the policy networks in Alpha Go [127]. This is known as a stochastic policy, which can be beneficial in environments in which exploration is important. Policy networks can also output deterministic actions [128], where the policy directly specifies the action to be taken without any probabilistic behavior. A value network, on the other hand, typically estimates the reward value of taking a particular action in a given state [125]. The estimated function is called the state-value function or, as more commonly known, the Q‑function (denoted as \(Q(s,a)\)) [124], and such learning is well known as deep Q‑learning, which has been demonstrated to successfully achieve human-level control in playing Atari 2600 games directly from pixel data [125]. The optimal Q‑function value at a given state \(s\) and a given action \(a\) is further denoted as \(Q^{\ast}(s,a)\), which is mathematically defined as
where \(r_{t}\) is the immediate reward at time \(t\) and \(\gamma_{t+1}\) is a discount factor for the maximum future reward obtained by the next state \(s_{t+1}\) and action \(a_{t+1}\). The optimal \(Q^{\ast}\) functions have the following recurrence relation, i.e., the Bellman optimality equation [129]:
When we use a DNN to estimate the optimal \(Q^{\ast}\) function in deep Q‑learning, Eq. (27) allows us to start with a random \(Q\) function (i.e., random weights \(\boldsymbol{\theta}\) for the DNN) and update the network iteratively with the Bellman optimality equation as the loss. With sufficient iterations, \(Q\) converges to \(Q^{\ast}\) [130].
Deep reinforcement learning has been investigated in many medical imaging tasks such as landmark detection [131,132,133], image registration [134, 135], and view selection [136]. In the field of radiation oncology, deep reinforcement learning also has great potential in tasks like lesion detection [137], cancer classification [138], smart patient scheduling [139], automated radiation adaptation [140, 141], and general decision support in oncology [142].
Representative networks of deep reinforcement learning
Deep Q‑networks (DQNs) Deep Q‑networks (DQNs) are nonlinear approximators in deep Q‑learning for the optimal Q functions: \(Q(s,a;\boldsymbol{\theta})\approx Q(s,a)\), where \(\boldsymbol{\theta}\) is the network parameter set. All the networks introduced in the above subsections (3.1 to 3.6) can be plugged as DQNs for deep Q‑learning. DQNs incorporate several key design elements to enhance learning stability and efficiency: experience replay and target networks [125]. Experience replay involves storing agent experiences at each time step in a replay buffer and randomly sampling mini-batches from this buffer to update the network. This method breaks the correlation between consecutive learning samples, thus stabilizing training. Target networks, on the other hand, are clones of the main network that are held fixed for several updates, providing consistent targets during temporal difference learning. This separation reduces the volatility of updates, thereby further stabilizing the learning process. Together, these mechanisms enable DQNs to learn effective policies in complex and dynamic environments.
An exemplary application of DQNs is anatomical landmark detection [131], which is beneficial for organ at risk (OAR) autocontouring in radiation therapy. Here, a DQN (which is a CNN) is used to locate the target organ region, i.e., the cuboid image subvolume to be subsequently processed by the organ-specific autosegmentation model. In this application, the state \(s\) represents the location of the center of the current volume of interest (VOI), the action \(s\) is to move the VOI to its neighboring position (moving up, down, left, right, front, and back), and \(r\) is a scalar reward function which calculates the distance between the current VOI location and the target organ center. The trained model has a policy \(\pi\) to tell the VOI which direction to move in at a given location, which forms a searching trajectory from the start location to the target organ.
Actor–critic methods Actor–critic methods [143, 144] represent a powerful class of algorithms in deep reinforcement learning that synergistically combine the strengths of policy- and value-based approaches. These methods utilize two main components: the actor, which is responsible for learning the policy function, and the critic, which evaluates the policy by learning the value function. The actor updates the policy distribution in the direction suggested by the critic, aiming to maximize expected rewards, whereas the critic assesses the actor’s actions by computing a value function. This dual structure allows actor–critic methods to be more stable and converge faster than policy-gradient methods alone, as the critic’s evaluation stabilizes the updates provided to the actor. Modern variations, such as the asynchronous advantage actor–critic (A3C) [145], further refine this approach by decoupling the policy and value updates, thereby enabling more efficient learning processes and robust policy formulation in complex environments. This framework effectively addresses the high variance issue of policy-gradient methods while maintaining a continuous learning update, making it particularly suited for problems with high-dimensional action spaces and environments with stochastic dynamics.
Training deep learning models
Deep learning frameworks
For the development of deep learning models, several frameworks have gained prominence. The most widely used frameworks among academic researchers and industrial developers are PyTorch and TensorFlow, which include the high-level API Keras. These were primarily built for the Python programming language. Frameworks developed in other programming languages include MATLAB’s Deep Learning Toolbox, Java’s Deeplearning4j, and Caffe for C/C++. Specifically for medical applications, the Medical Open Network for Artificial Intelligence (MONAI)Footnote 1, built on top of PyTorch, offers an extensive array of networks for various applications. These networks are not only ready to use but also easily modifiable to suit specific requirements.
Data preprocessing
Converting the intensity of imaging data into a range of \([-1,1]\) (normalization) or \([0,1]\) (standardization) is a common data preprocessing step for all deep learning algorithms. It provides several advantages for both training and testing:
-
Avoids saturation: many activation functions like sigmoid or tanh get saturated for very small or very large input values. In the saturated regions, gradients are near zero, leading to the vanishing gradient problem, which slows down or even halts training. By normalizing inputs to a range where the activation functions operate in their sensitive region, we can ensure that gradients are substantial enough for training to progress.
-
Feature uniformity: normalizing ensures that all input features are on the same scale. Without normalization, features with larger scales might dominate the training process, making the model less sensitive to features with smaller scales. This is especially important for multimodality imaging data, since CT values range within a large scale of \([-1000,3000]\) HU, while MRI values are not universally standardized like CT values.
-
Improved generalization: by ensuring that the model is trained on a standardized scale of input data, it is more likely that the model will generalize well to new, unseen data, provided the unseen data are also normalized in the same manner.
For MRI, bias correction, such as using the N4ITK method [146] to address bias field inhomogeneity, is an essential preprocessing step. It ensures consistent image quality and facilitates better feature extraction by the neural network. For example, in brain metastasis segmentation [12], false-positive segmentations might arise because of the hyperintensity in inhomogeneous regions. In addition to intensity normalization, the spatial normalization achieved by image registration is another important preprocessing step. Aligning multiple MRI scans or sequences (either intra- or interpatient) to a common spatial template, e.g., the Montreal Neurological Institute (MNI) space for brain imaging [147], is crucial for multimodal analyses and longitudinal studies. In addition, for brain imaging, skull stripping [148] is a preferred step to remove non-brain structures, which focuses the model’s attention on relevant brain tissues and helps to ensure data privacy.
Loss functions
Mean squared error (MSE; i.e., the L2 loss) and mean absolute error (MAE; i.e., the L1 loss) are commonly employed when aiming to regress continuous values, such as predicting the radiation dose–response behavior of tissues. Huber loss offers a compromise between MSE and MAE, proving beneficial in scenarios with occasional outliers. Perceptual loss using VGG network features [149] is also commonly used for medical imaging to enhance high-level details.
For segmentation tasks, where delineating tumors and organs at risk is paramount, Dice loss or Jaccard/intersection over union (IoU) loss are often preferred in addition to cross-entropy loss. When categorizing patient responses or outcomes, cross-entropy loss is a go-to choice, especially for multiclass classification. A loss odyssey for medical image segmentation is available in [150].
Survival prognosis is a common task in radiation oncology. Cox proportional hazards models [151] are a class of effective survival models that typically use the negative partial likelihood loss function [152]. The partial likelihood loss only includes individuals who have experienced the event (i.e., uncensored data) at each observed event time. For each event time, it compares the risk set (those who have neither had the event nor been censored) to the risk of the individuals who experienced the event at that time. The partial likelihood method effectively handles right-censored data, since it considers only those individuals who have experienced the event of interest at each observed event time and compares their risk relative to others in the risk set at that time. The objective is to maximize this partial likelihood, which is equivalent to minimizing the negative log of this likelihood. For categorical survival prediction (i.e., survival or risk classification; for example, binary high-/low-risk classification), the cross-entropy loss [153] has been widely used. However, the regular cross-entropy loss may lead to a high prediction error and a heavy bias. Therefore, the negative log-likelihood function of a discrete time-to-event model has been proposed as a theoretically sound and easy-to-implement solution [153]. In many applications, other loss functions are commonly used as a joint loss together with the negative log-likelihood loss, for example, the ranking loss in the DeepHit method [154] and the MSE/MAE loss for normally distributed errors in the accelerated failure time model [155].
For medical data, class imbalance is a pervasive challenge that can skew the learning process and result in suboptimal model performance. Loss functions play a critical role in addressing this imbalance. Traditional loss functions like cross-entropy loss can be adapted using weighted or cost-sensitive learning, where classes are assigned different weights based on their representation [150]. The focal loss [156] is designed explicitly for scenarios where the foreground–background class imbalance is high, as it modulates the contribution of each sample to the loss based on its difficulty. Sensitivity-specificity losses [12, 157] have been demonstrated to be effective for brain metastasis segmentation, where the class imbalance problem is addressed by weighting sensitivity higher. Other techniques like oversampling, undersampling, and synthetic sample generation can be combined with these loss functions to provide a holistic approach to managing class imbalance. For example, the weighted random sampler from PyTorch is an effective choice.
Given the multifaceted challenges in radiation oncology, it is not uncommon to see hybrid or combined loss functions, which amalgamate the strengths of individual loss metrics to address complex objectives.
Optimizers
Given the loss function, deep learning frameworks like PyTorch and TensorFlow will calculate the gradients of the loss function with respect to the trainable network weights automatically through the backpropagation mechanism [66, 158], and gradient descent optimization algorithms can be applied to update and train the network weights. Traditional optimizers like stochastic gradient descent (SGD) [158], revered for its simplicity and predictability, can sometimes be slow to converge and susceptible to local minima. Advanced variants like Momentum or Nesterov accelerated gradient descent enhance SGD by adding velocity components, thus alleviating some of its challenges [11]. Adaptive moment estimation (Adam) [159] has gained significant traction due to its adaptive learning rates for each parameter, making it robust against a range of initial settings and often leading to faster convergence. Its variant, AdamW [160], which incorporates weight decay, further refines the optimization landscape, potentially improving generalization in radiation oncology tasks. RMSProp [11], which adjusts the learning rate using a moving average of recent gradients, can also be effective in scenarios with noisy data. The choice among these optimizers should be influenced by the nature of the data, the architecture of the neural network, and the specific clinical objectives.
A comprehensive overview of gradient descent optimizers can be found in [11]. A visual explanation with 3D animations of different optimizers can be found https://towardsdatascience.com/a-visual-explanation-of-gradient-descent-methods-momentum-adagrad-rmsprop-adam-f898b102325c [161]. For most applications in radiation oncology, the Adam optimizer is recommended.
Learning rate
The learning rate directly influences the convergence speed, stability, and eventual performance of a model [162]. Essentially dictating the step size in the optimization landscape, a high learning rate can expedite convergence but may overshoot minima or even diverge, while a very low learning rate ensures stable convergence but risks getting trapped in local minima or taking prohibitively long to train.
To address these challenges, learning rate schedulers, such as step decay, exponential decay, and cosine annealing, have been introduced to dynamically adjust the learning rate based on epochs or iterations. Cyclical learning rates, which vary the rate between two bounds, can also help in navigating the loss landscape effectively.
The influence of learning rate and its scheduling also varies with the choice of optimizer. For instance, SGD [158] often requires careful initial learning rate tuning and can greatly benefit from the aforementioned schedulers. Adam [159], which incorporates adaptive learning rates for each parameter, offers more robustness to initial choices but is not immune to benefits from dynamic rate adjustments. A fixed learning rate between \(10^{-3}\) and \(10^{-5}\) is commonly recommended for Adam. In essence, while modern optimizers and schedulers provide tools to alleviate some sensitivities associated with learning rates, their optimal determination and adjustment remain a blend of empirical experimentation and domain knowledge.
Batch size
Batch size significantly influences both the model’s convergence behavior and its computational efficiency [162]. A smaller batch size offers more frequent weight updates, potentially leading to faster convergence and enabling the model to escape from local minima or saddle points, albeit at the cost of increased noise in the gradient estimates. Conversely, a larger batch size provides more accurate gradient estimates due to averaging across more samples, resulting in stable convergence but with a heightened risk of settling in sharp minima, which may impact generalization. Additionally, larger batches exploit the parallel processing capabilities of modern GPUs more effectively, leading to faster epoch times. However, they also demand greater GPU memory resources.
In summary, while smaller batches often promote better generalization and model robustness, larger batches accelerate training and offer computational advantages.
Monitoring of overfitting
Monitoring overfitting during the training of neural networks is of paramount importance to ensure that models generalize well to unseen data rather than merely memorizing the training dataset. Overfitting manifests when a model exhibits significantly better performance on training data compared to validation or test data. To detect this, it is common practice to split the dataset into separate training and validation subsets and closely observe the model’s performance metrics on both. Visualizing the learning curves, where training and validation losses are plotted against epochs, as displayed in Fig. 18, can provide clear indications of overfitting when the validation performance begins to degrade while the training performance continues to improve. With such monitoring, early stopping can be applied to mitigate overfitting. In other words, training will be stopped when the performance on the validation set starts to degrade (i.e., when the validation loss begins to increase), even if the training loss continues to decrease.
In addition, regularization techniques and dropout layers [163] can be employed to mitigate overfitting. Regularization methods add a penalty to the loss function to discourage overly complex models. Common techniques include L1 and L2 regularization. L1 regularization adds a penalty equivalent to the absolute value of the weights, which encourages the model to have more parameters whose values are close to zero. L2 regularization adds a penalty equal to the square of the magnitude of the coefficients, effectively penalizing large weights to prevent them from having too much influence. Dropout [163] is a technique used specifically in neural networks. It works by randomly setting a fraction of input units to zero at each update during training time, which helps to prevent neurons from co-adapting too much. This randomness forces the network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
Conclusion
This paper provides a comprehensive overview of deep learning principles tailored toward radiation oncologists and medical physics experts. It elucidates the fundamental principles of major deep learning models, encompassing MLPs, CNNs, RNNs, transformers, GANs, diffusion-based generative models, and reinforcement learning. For each category, it presents representative networks alongside their specific applications in radiation oncology. Moreover, it outlines critical factors essential for training deep learning models, such as data preprocessing, loss functions, optimizers, and other pivotal training parameters including learning rate and batch size. This manuscript is designed to improve the comprehension of AI-based research and software tools in radiation oncology. By doing so, it seeks to connect the intricate technological theories of AI with the practical aspects of clinical practice in radiation oncology.
Notes
References
Meyer P, Noblet V, Mazzara C, Lallement A (2018) Survey on deep learning for radiotherapy. Comput Biol Med 98:126–146
Boldrini L, Bibault JE, Masciocchi C, Shen Y, Bittner MI (2019) Deep learning: a review for the radiation oncologist. Front Oncol 9:977
Li G, Wu X, Ma X (2022) Artificial intelligence in radiotherapy. Semin Cancer Biol 86:160–171 (Elsevier;)
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. http://www.deeplearningbook.org (MIT Pres)
Charu CA (2018) Neural networks and deep learning: a textbook. Spinger
Mayerhoefer ME, Materka A, Langs G, Häggström I, Szczypiński P, Gibbs P et al (2020) Introduction to radiomics. J Nucl Med 61(4):488–495
Tanadini-Lang S, Balermpas P, Guckenberger M, Pavic M, Riesterer O, Vuong D et al (2020) Radiomic biomarkers for head and neck squamous cell carcinoma. Strahlenther Onkol 196(10):868–878
Van Griethuysen JJ, Fedorov A, Parmar C, Hosny A, Aucoin N, Narayan V et al (2017) Computational radiomics system to decode the radiographic phenotype. Cancer Res 77(21):e104–e107
Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Their Appl 13(4):18–28
Lou B, Doken S, Zhuang T, Wingerter D, Gidwani M, Mistry N et al (2019) An image-based deep learning framework for individualising radiotherapy dose: a retrospective analysis of outcome prediction. Lancet Digit Health 1(3):e136–e147
Ruder S (2016) An overview of gradient descent optimization algorithms. Arxiv Prepr Arxiv 160904747:1–14
Huang Y, Bert C, Sommer P, Frey B, Gaipl U, Distel LV et al (2022) Deep learning for brain metastasis detection and segmentation in longitudinal MRI data. Med Phys 49(9):5773–5786
De Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A (2008) Clustering cancer gene expression data: a comparative study. BMC Bioinform 9(1):1–14
Kabir MF, Chen T, Ludwig SA (2023) A performance analysis of dimensionality reduction algorithms in machine learning models for cancer prediction. Healthc Anal 3:100125
Thies M, Wagner F, Huang Y, Gu M, Kling L, Pechmann S et al (2022) Calibration by differentiation–Self-supervised calibration for X‑ray microscopy using a differentiable cone-beam reconstruction operator. J Microsc 287(2):81–92
Huang Y, Taubmann O, Huang X, Haase V, Lauritsch G, Maier A (2018) Scale-space anisotropic total variation for limited angle tomography. IEEE Trans Radiat Plasma Med Sci 2(4):307–314
Rumelhart DE, Hinton GE, Williams RJ Learning internal representations by error propagation. Institute for cognitive science. University of California, San Diego La
Bank D, Koenigstein N, Autoencoders GR (2023) Machine learning for data science handbook: data mining and knowledge discovery handbook, pp 353–374
Grandinetti J, Gao Y, Gonzalez Y, Deng J, Shen C, Jia X (2022) MR image reconstruction from undersampled data for image-guided radiation therapy using a patient-specific deep manifold image prior. Front Oncol 12:1013783
Huang P, Shang J, Xu Y, Hu Z, Zhang K, Dai J et al (2023) Anomaly detection in radiotherapy plans using deep autoencoder networks. Front Oncol 13:973
Ma B, Guo J, Chu H, van Dijk LV, van Ooijen PM, Langendijk JA et al (2023) Comparison of computed tomography image features extracted by radiomics, self-supervised learning and end-to-end deep learning for outcome prediction of oropharyngeal cancer. Phys Imaging Radiat Oncol 28:100502
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 30:
Zien A, Krämer N, Sonnenburg S, Rätsch G (2009) The feature importance ranking measure. In: Machine Learning and Knowledge Discovery in Databases: European Conference vol 20. Springer, Bled, Slovenia, pp 694–709
Simonyan K, Vedaldi A, Zisserman A (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. In: Proceedings of the International Conference on Learning Representations (ICLR). ICLR, pp 1–8
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:
Saleem R, Yuan B, Kurugollu F, Anjum A, Liu L (2022) Explaining deep neural networks: A survey on the global interpretation methods. Neurocomputing 513:165–180
Joshi G, Walambe R, Kotecha K (2021) A review on explainability in multimodal deep neural nets. IEEE Access 9:59800–59821
Raissi M, Yazdani A, Karniadakis GE (2020) Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science 367(6481):1026–1030
Cai S, Mao Z, Wang Z, Yin M, Karniadakis GE (2021) Physics-informed neural networks (PINNs) for fluid mechanics: A review. Acta Mech Sin 37(12):1727–1738
Maier AK, Syben C, Stimpel B, Würfl T, Hoffmann M, Schebesch F et al (2019) Learning with known operators reduces maximum error bounds. Nat Mach Intell 1(8):373–380
Lipková J, Angelikopoulos P, Wu S, Alberts E, Wiestler B, Diehl C et al (2019) Personalized radiotherapy design for glioblastoma: integrating mathematical tumor models, multimodal scans, and Bayesian inference. IEEE Trans Med Imaging 38(8):1875–1884
Lagergren JH, Nardini JT, Baker RE, Simpson MJ, Flores KB (2020) Biologically-informed neural networks guide mechanistic modeling from sparse experimental data. Plos Comput Biol 16(12):e1008462
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc Ieee 86(11):2278–2324
Rosenblatt F (1957) The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory
Narayan S (1997) The generalized sigmoid activation function: Competitive supervised learning. Inf Sci 99(1-2):69–82
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning. ICML, vol 10, pp 807–814
Xu J, Li Z, Du B, Zhang M, Liu J (2020) Reluplex made more practical: Leaky ReLU. In: 2020 IEEE Symposium on Computers and communications (ISCC). IEEE, pp 1–7
Ramachandran P, Zoph B, Le QV (2018) Searching for activation functions. In: Proceedings of International Conference on Learning Representations (ICLR), pp 1–12
Tao Q, Li L, Huang X, Xi X, Wang S, Suykens JA (2022) Piecewise linear neural networks and deep learning. Nat Rev Methods Primers 2(1):42
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Networks 2(5):359–366
Lu Z, Pu H, Wang F, Hu Z, Wang L (2017) The expressive power of neural networks: A view from the width. Adv Neural Inf Process Syst 30:
Kamnitsas K, Ledig C, Newcombe VF, Simpson JP, Kane AD, Menon DK et al (2017) Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation. Med Image Anal 36:61–78
Zeiler MD, Fergus R (2014) Visualizing and Understanding Convolutional Networks. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T (eds) Computer Vision – ECCV, vol 2014. Springer, Cham, pp 818–833
Chadha A (2020) Visualizing and understanding. Distilled notes for stanford CS231n: Convolutional neural networks for visual recognition. https://aman.ai
Cheng MM, Jiang PT, Han LH, Wang L, Torr P (2023) Deeply explain CNN via hierarchical decomposition. Int J Comput Vis 131(5):1091–1105
Gong E, Pauly JM, Wintermark M, Zaharchuk G (2018) Deep learning enables reduced gadolinium dose for contrast-enhanced brain MRI. J Magn Reson Imaging 48(2):330–340
Ayde R, Senft T, Salameh N, Sarracanie M (2022) Deep learning for fast low-field MRI acquisitions. Sci Rep 12(1):11394
Huang Y, Preuhs A, Manhart M, Lauritsch G, Maier A (2021) Data extrapolation from learned prior images for truncation correction in computed tomography. IEEE Trans Med Imaging 40(11):3042–3053
Chen X, Sun S, Bai N, Han K, Liu Q, Yao S et al (2021) A deep learning-based auto-segmentation system for organs-at-risk on whole-body computed tomography images for radiation therapy. Radiother Oncol 160:175–184
Weissmann T, Huang Y, Fischer S, Roesch J, Mansoorian S, Gaona AH et al (2023) Deep learning for automatic head and neck lymph node level delineation provides expert-level accuracy. Front Oncol 13:1115258
Kazemifar S, Balagopal A, Nguyen D, McGuire S, Hannan R, Jiang S et al (2018) Segmentation of the prostate and organs at risk in male pelvic CT images using deep learning. Biomed Phys Eng Express 4(5):55003
Valdes G, Simone CB II, Chen J, Lin A, Yom SS, Pattison AJ et al (2017) Clinical decision support of radiotherapy treatment planning: A data-driven machine learning strategy for patient-specific dosimetric decision making. Radiother Oncol 125(3):392–397
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456
Ronneberger O, Fischer P, Brox T (2015) U‑net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference. Proceedings, Part III, vol 18. Springer, Munich, Germany, pp 234–241 (October 5‑9)
Falk T, Mai D, Bensch R, Çiçek Ö, Abdulkadir A, Marrakchi Y et al (2019) U‑Net: deep learning for cell counting, detection, and morphometry. Nat Methods 16(1):67–70
Isensee F, Jaeger PF, Kohl SA, Petersen J, Maier-Hein KH (2021) nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18(2):203–211
Pflüger I, Wald T, Isensee F, Schell M, Meredig H, Schlamp K et al (2022) Automated detection and quantification of brain metastases on clinical MRI data using artificial neural networks. Neuro-oncology Adv 4(1):vdac138
Buchner JA, Kofler F, Etzel L, Mayinger M, Christ SM, Brunner TB et al (2023) Development and external validation of an MRI-based neural network for brain metastasis segmentation in the AURORA multicenter study. Radiother Oncol 178:109425
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Arxiv Prepr Arxiv 14091556:1–14
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
DiPietro R, Hager GD (2020) Deep learning: RNNs and LSTM. In: Handbook of medical image computing and computer assisted intervention. Elsevier, pp 503–519
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc Ieee 78(10):1550–1560
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning. Pmlr, pp 1310–1318
Hochreiter S, Schmidhuber J (1997) Long short term memory. Neural Comput 9(8):1735–1780
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: Continual prediction with LSTM. Neural Comput 12(10):2451–2471
Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing. EMNLP, vol 2014
Cheon W, Kim H, Kim J (2020) Deep learning in radiation oncology. Prog Med Phys 31(3):111–123
Wang R, Liang X, Zhu X, Xie Y (2018) A feasibility of respiration prediction based on deep Bi-LSTM for real-time tumor tracking. IEEE Access 6:51262–51268
Zhang L, Lu L, Wang X, Zhu RM, Bagheri M, Summers RM et al (2019) Spatio-temporal convolutional LSTMs for tumor growth prediction by learning 4D longitudinal patient data. IEEE Trans Med Imaging 39(4):1114–1126
Lombardo E, Rabe M, Xiong Y, Nierer L, Cusumano D, Placidi L et al (2023) Evaluation of real-time tumor contour prediction using LSTM networks for MR-guided radiotherapy. Radiother Oncol 182:109555
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: International Conference on Learning Representations, pp 1–12
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:1–9
Bahdanau D, Cho KH, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations. ICLR, vol 2015, pp 1–15
Kenton JDMWC, Toutanova LK (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT. In, pp 4171–4186
Guckenberger M, Andratschke N, Ahmadsei M, Christ SM, Heusel AE, Kamal S et al (2023) Potential of chatGPT in facilitating research in radiation oncology? Radiother Oncol 188:109894
Huang Y, Gomaa A, Semrau S, Haderlein M, Lettmaier S, Weissmann T et al (2023) Benchmarking chatGPT‑4 on a radiation oncology in-training exam and red journal Gray zone cases: potentials and challenges for AI-assisted medical education and decision making in radiation oncology. Front Oncol 13:
Moll M, Heilemann G, Georg D, Kauer-Dorner D, Kuess P (2024) The role of artificial intelligence in informed patient consent for radiotherapy treatments—a case report. Strahlenther Onkol: 1–5
He K, Gan C, Li Z, Rekik I, Yin Z, Ji W et al (2023) Transformers in medical image analysis. Intell Med 3(1):59–78
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T et al (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations. In (Available from: https://openreview.net/forum?id=YicbFdNTTy)
He S, Grant PE, Ou Y (2021) Global-local transformer for brain age estimation. IEEE Trans Med Imaging 41(1):213–224
Golovanevsky M, Eickhoff C, Singh R (2022) Multimodal attention-based deep learning for Alzheimer’s disease diagnosis. J Am Med Inform Assoc 29(12):2014–2022
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y et al (2021) TransUNet: Transformers make strong encoders for medical image segmentation. Arxiv Prepr Arxiv 210204306:1–13
Zheng H, Lin Z, Zhou Q, Peng X, Xiao J, Zu C et al (2022) Multi-transSP: Multimodal transformer for survival prediction of nasopharyngeal carcinoma patients. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, pp 234–243
Dai Z, Yang Z, Yang Y, Carbonell JG, Le Q, Salakhutdinov R (2019) Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 2978–2988
Katharopoulos A, Vyas A, Pappas N, Fleuret F (2020) Transformers are RNNs: Fast autoregressive transformers with linear attention. In: International conference on machine learning. PMLR, pp 5156–5165
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q et al (2022) Swin-UNet: UNet-like pure transformer for medical image segmentation. In: European conference on computer vision. Springer, pp 205–218
Hatamizadeh A, Nath V, Tang Y, Yang D, Roth HR, Xu D (2021) Swin UNETR: Swin transformers for semantic segmentation of brain tumors in MRI images. In: International MICCAI Brainlesion Workshop. Springer, pp 272–284
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S et al (2014) Generative adversarial nets. Adv Neural Inf Process Syst: 2672–2680
Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125–1134
Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232
Yi X, Walia E, Babyn P (2019) Generative adversarial network in medical imaging: A review. Med Image Anal: 101552
Liang X, Chen L, Nguyen D, Zhou Z, Gu X, Yang M et al (2019) Generating synthesized computed tomography (CT) from cone-beam computed tomography (CBCT) using CycleGAN for adaptive radiation therapy. Phys Med Biol 64(12):125002
Kurz C, Maspero M, Savenije MH, Landry G, Kamp F, Pinto M et al (2019) CBCT correction using a cycle-consistent generative adversarial network and unpaired training to enable photon and proton dose calculation. Phys Med Biol 64(22):225004
Wang H, Liu X, Kong L, Huang Y, Chen H, Ma X et al (2023) Improving CBCT image quality to the CT level using RegGAN in esophageal cancer adaptive radiotherapy. Strahlenther Onkol: 1–13
Kazemifar S, Barragán Montero AM, Souris K, Rivas ST, Timmerman R, Park YK et al (2020) Dosimetric evaluation of synthetic CT generated with GANs for MRI-only proton therapy treatment planning of brain tumors. J Applied Clin Med Phys 21(5):76–86
Liu X, Emami H, Nejad-Davarani SP, Morris E, Schultz L, Dong M et al (2021) Performance of deep learning synthetic CTs for MR-only brain radiation therapy. J Applied Clin Med Phys 22(1):308–317
Elmahdy MS, Wolterink JM, Sokooti H, Išgum I, Staring M (2019) Adversarial optimization for joint registration and segmentation in prostate CT radiotherapy. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference. Proceedings, Part VI, vol 22. Springer, Shenzhen, China, pp 366–374 (October 13–17)
Song Y, Sohl-Dickstein J, Kingma DP, Kumar A, Ermon S, Poole B (2021) Score-based generative modeling through stochastic differential equations. In: International Conference on Learning Representations, pp 1–36
Song Y, Ermon S (2019) Generative modeling by estimating gradients of the data distribution. Adv Neural Inf Process Syst 32:
Ho J, Jain A, Abbeel P (2020) Denoising diffusion probabilistic models. Adv Neural Inf Process Syst 33:6840–6851
Song J, Meng C, Ermon S (2021) Denoising diffusion implicit models. In: International Conference on Learning Representations, pp 1–20
Yang L, Zhang Z, Song Y, Hong S, Xu R, Zhao Y et al (2023) Diffusion models: A comprehensive survey of methods and applications. ACM Comput Surv 56(4):1–39
Anderson BD (1982) Reverse-time diffusion equation models. Stoch Process Their Appl 12(3):313–326
Vincent P (2011) A connection between score matching and denoising autoencoders. Neural Comput 23(7):1661–1674
Jolicoeur-Martineau A, Li K, Piché-Taillefer R, Kachman T, Mitliagkas I (2021) Gotta go fast when generating data with score-based models. Arxiv Prepr Arxiv 210514080:1–24
Lu C, Zhou Y, Bao F, Chen J, Li C, DPM-solver ZJ (2022) A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv Neural Inf Process Syst 35:5775–5787
Zhang Q, Chen Y (2022) Fast sampling of diffusion models with exponential integrator. In: The eleventh international conference on learning representations, pp 1–33
Kazerouni A, Aghdam EK, Heidari M, Azad R, Fayyaz M, Hacihaliloglu I et al (2023) Diffusion models in medical imaging: A comprehensive survey. Med Image Anal: 102846
Özbey M, Dalmaz O, Dar SU, Bedel HA, Özturk Ş, Güngör A et al (2023) Unsupervised medical image translation with adversarial diffusion models. IEEE Trans Med Imaging
Jalal A, Arvinte M, Daras G, Price E, Dimakis AG, Tamir J (2021) Robust compressed sensing MRI with deep generative priors. Adv Neural Inf Process Syst 34:14938–14954
Chung H, Ye JC (2022) Score-based diffusion models for accelerated MRI. Med Image Anal 80:102479
Fernandez V, Pinaya WHL, Borges P, Tudosiu PD, Graham MS, Vercauteren T et al (2022) Can segmentation models be trained with fully synthetically generated data? In: International Workshop on Simulation and Synthesis in Medical Imaging. Springer, In, pp 79–90
Xiang T, Yurt M, Syed AB, Setsompop K, Chaudhari ADDM (2023) Self-Supervised Diffusion MRI Denoising with Generative Diffusion Models. Arxiv Prepr Arxiv 230203018:
Sohl-Dickstein J, Weiss E, Maheswaranathan N, Ganguli S (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In: International conference on machine learning. PMLR, pp 2256–2265
Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10684–10695
Blattmann A, Rombach R, Ling H, Dockhorn T, Kim SW, Fidler S et al (2023) Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 22563–22575
Sutton RS, Barto AG (2018) Reinforcement learning: An introduction. MIT Press
Shakya AK, Pillai G, Chakrabarty S (2023) Reinforcement learning algorithms: A brief survey. Expert Syst Appl: 120495
Zhou SK, Le HN, Luu K, Nguyen HV, Ayache N (2021) Deep reinforcement learning in medical imaging: A literature review. Med Image Anal 73:102193
Watkins CJ, Dayan P (1992) Q‑learning. Mach Learn 8:279–292
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Sutton RS, McAllester D, Singh S, Mansour Y (1999) Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst 12:
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489
Shen Q, Li Y, Jiang H, Wang Z, Zhao T (2020) Deep reinforcement learning with robust and smooth policy. In: International Conference on Machine Learning. PMLR, pp 8707–8718
Bellman R (1966) Dynamic programming. Science 153(3731):34–37
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D et al (2013) Playing atari with deep reinforcement learning. Arxiv Prepr Arxiv 13125602:
Ghesu FC, Georgescu B, Zheng Y, Grbic S, Maier A, Hornegger J et al (2017) Multi-scale deep reinforcement learning for real-time 3D-landmark detection in CT scans. IEEE Trans Pattern Anal Mach Intell 41(1):176–189
Al WA, Yun ID (2019) Partial policy-based reinforcement learning for anatomical landmark localization in 3d medical images. IEEE Trans Med Imaging 39(4):1245–1255
Aa ZYA, Voigt I, Mihalef V, Houle H, John M et al (2020) A bottom-up approach for real-time mitral valve annulus modeling on 3D echo images. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference. VI, vol 23. Springer, Lima, Peru, pp 458–467
Liao R, Miao S, de Tournemire P, Grbic S, Kamen A, Mansi T et al (2017) An artificial agent for robust image registration. In: Proceedings of the AAAI conference on artificial intelligence, vol 31. In
Krebs J, Mansi T, Delingette H, Zhang L, Ghesu FC, Miao S et al (2017) Robust non-rigid registration through agent-based action learning. In: Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, vol 20. Springer, Canada, pp 344–352 (September 11-13, 2017)
Dou H, Yang X, Qian J, Xue W, Qin H, Wang X et al (2019) Agent with warm start and active termination for plane localization in 3D ultrasound. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, In, pp 290–298
Maicas G, Carneiro G, Bradley AP, Nascimento JC, Reid I (2017) Deep reinforcement learning for active breast lesion detection from DCE-MRI. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 665–673
Xu B, Liu J, Hou X, Liu B, Garibaldi J, Ellis IO et al (2019) Attention by selection: A deep selective attention approach to breast cancer classification. IEEE Trans Med Imaging 39(6):1930–1941
Lee S, Lee YH (2020) Improving emergency department efficiency by patient scheduling using deep reinforcement learning. Healthcare 8:77
Tseng HH, Luo Y, Cui S, Chien JT, Ten Haken RK, Naqa IE (2017) Deep reinforcement learning for automated radiation adaptation in lung cancer. Med Phys 44(12):6690–6705
Ebrahimi S, Lim GJ (2021) A reinforcement learning approach for finding optimal policy of adaptive radiation therapy considering uncertain tumor biological response. Artif Intell Med 121:102193
Niraula D, Jamaluddin J, Matuszak MM, Haken RKT, Author Correction NIE (2023) Quantum deep reinforcement learning for clinical decision support in oncology: application to adaptive radiotherapy. Sci Rep 13(1):2318
Peters J, Schaal S (2008) Natural actor-critic. Neurocomputing 71(7-9):1180–1190
Heess N, Wayne G, Silver D, Lillicrap T, Erez T, Tassa Y (2015) Learning continuous control policies by stochastic value gradients. Adv Neural Inf Process Syst 28:
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T et al (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning. PMLR, pp 1928–1937
Tustison NJ, Avants BB, Cook PA, Zheng Y, Egan A, Yushkevich PA et al (2010) N4ITK: improved N3 bias correction. IEEE Trans Med Imaging 29(6):1310–1320
Brett M, Johnsrude IS, Owen AM (2002) The problem of functional localization in the human brain. Nat Rev Neurosci 3(3):243–249
Isensee F, Schell M, Pflueger I, Brugnara G, Bonekamp D, Neuberger U et al (2019) Automated brain extraction of multisequence MRI using artificial neural networks. Hum Brain Mapp 40(17):4952–4964
Sharif S, Naqvi RA, Biswas M, Loh WK (2022) Deep perceptual enhancement for medical image analysis. IEEE J Biomed Health Inform 26(10):4826–4836
Ma J, Chen J, Ng M, Huang R, Li Y, Li C et al (2021) Loss odyssey in medical image segmentation. Med Image Anal 71:102035
Cox DR (1972) Regression models and life-tables. J Royal Stat Soc Ser B 34(2):187–202 (Methodological)
Efron B (1977) The efficiency of Cox’s likelihood function for censored data. J Am Stat Assoc 72(359):557–565
Zadeh SG, Schmid M (2020) Bias in cross-entropy-based training of deep survival networks. IEEE Trans Pattern Anal Mach Intell 43(9):3126–3137
Lee C, Zame W, Yoon J, Van Der Schaar M (2018) DeepHit: A deep learning approach to survival analysis with competing risks. Proc Aaai Conf Artif Intell 32:
Crowther MJ, Royston P, Clements M (2023) A flexible parametric accelerated failure time model and the extension to time-dependent acceleration factors. Biostatistics 24(3):811–831
Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Brosch T, Yoo Y, Tang L, Li D, Traboulsee A, Tam R (2015) Deep convolutional encoder networks for multiple sclerosis lesion segmentation. Medical Image Computing and Computer-Assisted Intervention. Springer, Cham (MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science)
Si A (1993) Backpropagation and stochastic gradient descent method. Neurocomputing 5((4-5):185–196
Kingma DP, Ba JA (2015) A method for stochastic optimization. In: International Conference on Learning Representations, pp 1–13
Loshchilov I, Hutter F (2018) Decoupled Weight Decay Regularization. In: International Conference on Learning Representations, pp 1–18
Jiang L (2020) A visual explanation of gradient descent methods (Momentum, Ada-Grad, RMSProp, Adam)
He F, Liu T, Tao D (2019) Control batch size and learning rate to generalize well: Theoretical and empirical evidence. Adv Neural Inf Process Syst 32:
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Dropout SR (2014) a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflicts of interest.
Ethical standards
This work does not involve any human or animal material. Approval by an ethics committee was not required.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
List of symbols and terms
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Huang, Y., Gomaa, A., Höfler, D. et al. Principles of artificial intelligence in radiooncology. Strahlenther Onkol (2024). https://doi.org/10.1007/s00066-024-02272-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00066-024-02272-0