Key words

1 Introduction

Image segmentation is an essential and challenging task in medical image analysis. Its goal is to delineate the object boundaries by assigning each pixel/voxel a label, where pixels/voxels with the same labels share similar properties or belong to the same class. In the context of neuroimaging, robust and accurate image segmentation can effectively help neurosurgeons and doctors, e.g., measure the size of brain lesions or quantitatively evaluate the volume changes of brain tissue throughout treatment or surgery. For instance, quantitative measurements of subcortical and cortical structures are critical for studies of several neurodegenerative diseases such as Alzheimer’s, Parkinson’s, and Huntington’s diseases. Automatic segmentation of multiple sclerosis (MS) lesions is essential for the quantitative analysis of disease progression. The delineation of acute ischemic stroke lesions is crucial for increasing the likelihood of good clinical outcomes for the patient. While manual delineation of object boundaries is a tedious and time-consuming task, automatic segmentation algorithms can significantly reduce the workload of clinicians and increase the objectivity and reproducibility of measurements. To be specific, the segmentation task in medical images usually refers to semantic segmentation. For example, for paired brain structures (e.g., left and right pairs of subcortical structures), the instances of the same category will not be specified in the segmentation, in contrast to instance and panoptic segmentation.

There are many neuroimaging modalities such as magnetic resonance imaging, computed tomography, transcranial Doppler, and positron emission tomography. Moreover, neuroimaging studies often contain multimodal and/or longitudinal data, which can help improve our understanding of the anatomical and functional properties of the brain by utilizing complementary physical and physiological sensitivities. In this chapter, we first present some background information to help readers get familiar with the fundamental elements used in deep learning-based segmentation frameworks. Next, we discuss the learning-based segmentation approaches in the context of different supervision settings, along with some real-world applications.

2 Methods

2.1 Fundamentals

2.1.1 Common Network Architectures for Segmentation Tasks

Convolutional neural networks (CNNs) dominated the medical image segmentation field in recent years. CNNs leverage information from images to predict segmentations by hierarchically learning parameters with linear and nonlinear layers. We begin by discussing some popular models and their architectures: (1) U-Net [1], (2) V-Net [2], (3) attention U-Net [3, 4], and (4) nnU-Net [5, 6].

U-Net is the most popular model for medical image segmentation, and its architecture is shown in Fig. 1. The network has two main parts: the encoder and the decoder, with skip connections in between. The encoder consists of two repeated 3 × 3 convolutions (conv) without zero-padding, a rectified linear unit (ReLU) activation function. A max-pooling operation with stride 2 is used for connecting different levels or downsampling. We note that the channel number of feature maps is doubled at each subsequent level. In the symmetric decoder counterpart, a 2 × 2 up-convolution (up-conv) is used not only for upsampling but also for reducing the number of channels by half. The center-cropped feature map from the encoder is delivered to the decoder via skip connections at each level to preserve the low-level information. The cropping is needed to maintain the same size between feature maps for concatenation. Next, two repeated 3 × 3 conv and ReLU are applied. Lastly, a 1 × 1 conv is employed for converting the channel number to the desired number of classes C. In this configuration, the network takes a 2D image as input and produces a segmentation map with C classes. Later, a 3D U-Net [7] was introduced for volumetric segmentation that learns from volumetric images.

Fig. 1
A model of U-shaped net architecture depicts a feature map of different channel numbers of various operations highlighted by arrows. Arrows represent Conv 3 multiple 3, R e L U, copy and crop, max pool 2 multiple 2, up-conv 2 multiple 2, and conv 1 multiple 1.

U-Net architecture. Blue boxes are the feature maps. Channel numbers are denoted above each box, while the tensor sizes are denoted on the lower left. White boxes show the concatenations and arrows indicate various operations. Ⓒ2015 Springer Nature. Reprinted, with permission, from [1]

V-Net is another popular model for volumetric medical image segmentation. Based upon the overall structure of the U-Net, the V-Net [2] leverages the residual block [8] to replace the regular conv, and the convolution kernel size is enlarged to 5 × 5 × 5. The residual blocks can be formulated as follows: (1) the input of a residual block is processed by conv layers and nonlinearities, and (2) the input is added to the output from the last conv layer or nonlinearity of the residual block. It consists of a fully convolutional neural network trained end-to-end.

Attention U-Net is a model based on U-Net with attention gates (AG) in the skip connections (Fig. 2). The attention gates can learn to focus on the segmentation target. The salient features are emphasized with larger weights from the CNN during the training. This leads the model to achieve higher accuracy on target structures with various shapes and sizes. In addition, AGs are easy to integrate into the existing popular CNN architectures. The details of the attention mechanism and attention gates are discussed in Subheading 2.1.2. More details on attention can also be found in Chap. 6.

Fig. 2
A U-shaped architecture of the input image, and segmentation map on both ends of arrows indicate Conv 3 multiple 3 multiple 3 + R e L U, upsampling, max-pooling, skip connection, gating signal, concatenation, and attention gate.

Attention U-Net architecture. Hi, Wi, and Di represent the height, width, and depth of the feature map at the ith layer of the U-Net structure. Fi indicates the number of feature map channels. Replicated from [4] (CC BY 4.0)

nnU-Net is a medical image segmentation pipeline that can achieve a self-configuring network architecture based on the different datasets and tasks it is given, without any manual intervention. According to the dataset and task, nnU-Net will generate one of (1) 2D U-Net, (2) 3D U-Net, and (3) cascaded 3D U-Net for the segmentation network. For cascaded 3D U-Net, the first network takes downsampled images as inputs, and the second network uses the image at full resolution as input to refine the segmentation accuracy. The nnU-Net is often used as a baseline method in many medical image segmentation challenges, because of its robust performance across various target structures and image properties. The details of nnU-Net can be found in [6].

2.1.2 Attention Modules

Although the U-Net architecture described in Subheading 2.1.1 has achieved remarkable success in medical image segmentation, the downsampling steps included in the encoder path can induce poor segmentation accuracy for small-scale anatomical structures (e.g., tumors and lesions). To tackle this issue, the attention modules are often applied so that the salient features are enhanced by higher weights, while the less important features are ignored. This subsection will introduce two types of attention mechanisms: additive attention and multiplicative attention.

Additive Attention

As discussed in the previous section, U-Net is the most popular backbone for medical image analysis tasks. The downsampling enables it to work on features of different scales. Suppose we are working on a 3D segmentation problem. The output of the U-Net encoder at the lth level is then a tensor Xl of size [Fl, Hl, Wl, Dl], where Hl, Wl, Dl denote the height, width, and depth of the feature map, respectively, and Fl represents the length of the feature vectors. We regard the tensor as a set of feature vectors \( {\boldsymbol{x}}_i^l \):

$$ {\mathcal{X}}^l={\left\{{\boldsymbol{x}}_i^l\right\}}_{i=1}^n,\kern1em {\boldsymbol{x}}_i^l\in {\mathbb{R}}^{F_l} $$
(1)

where n = Hl × Wl × Dl. The attention gate assigns a weight αi to each vector xi so that the model can concentrate on salient features. Ideally, important features are assigned higher weight that will not vanish when downsampling. The output of the attention gate will be a collection of weighted feature vectors:

$$ {\hat{\mathcal{X}}}^l={\left\{{\alpha}_i^l\cdot {\boldsymbol{x}}_i^l\right\}}_{i=1}^n,\kern1em {\alpha}_i^l\in \mathbb{R} $$
(2)

These weights αi, also known as gating coefficients, are determined by an attention mechanism that delineates the correlation between the feature vector x and a gating signal g. As shown in Fig. 3, for all \( {\boldsymbol{x}}_i^l\in {\mathcal{X}}^l \), we compute an additive attention with regard to a corresponding gi by

$$ {s}_{att}^l={\boldsymbol{\psi}}^{\top}\left[{\sigma}_1\left({\boldsymbol{W}}_x^{\top }{\boldsymbol{x}}_i^l+{\boldsymbol{W}}_g^{\top }{\boldsymbol{g}}_i+{\boldsymbol{b}}_g\right)\right]+{b}_{\psi } $$
(3)

where bg and bψ represent the bias and Wx, Wg, ψ are linear transformations. The output dimension of the linear transformation is \( {\mathbb{R}}^{F_{int}} \) where Fint is a self-defined integer. Denote these learnable parameters by a set Θatt. The coefficients \( {s}_{att}^l \) are normalized to [0, 1] by a sigmoid function σ2:

$$ {\alpha}_i^l={\sigma}_2\left({s}_{att}^l\left({\boldsymbol{x}}_i^l,{\boldsymbol{g}}_i;{\Theta}_{att}\right)\right) $$
(4)

Basically, the attention gate is thus a linear combination of the feature vector and the gating signal. In practical applications [3, 4, 9], the gating signal is chosen to be the coarser feature space as indicated in Fig. 2. In other words, for input feature \( {\boldsymbol{x}}_i^l \), the corresponding gating signal is defined by

$$ {\boldsymbol{g}}_i={\boldsymbol{x}}_i^{l+1} $$
(5)

Note that an extra downsampling step should be applied on Xl so that it has the same shape as Xl+1. In experiments to segment brain tumor on MRI datasets [9] and the pancreas on CT abdominal datasets [4], AG was shown to improve the segmentation performance for diverse types of model backbones including U-Net and Residual U-Net.

Fig. 3
A schematic structure of G i, in W g and X i superscript l in W x leads to R e L U of sigma 1, si, sigmoid of sigma 2 and then to x cap of l and i.

The structure of the additive attention gate. \( {\boldsymbol{x}}_i^l \) is the ith feature vector at the lth level of the U-Net structure and gi is the corresponding gating signal. Wx and Wg are the linear transformation matrices applied to \( {\boldsymbol{x}}_i^l \) and gi, respectively. The sum of the resultant vectors will be activated by ReLU and then its dot product with a vector ψ is computed. The sigmoid function is used to normalize the resulting scalar to [0, 1] range, which is the gating coefficient αi. The weighted feature vector is denoted by \( \hat{{\boldsymbol{x}}_i^l} \). Adapted from [4] (CC BY 4.0)

Multiplicative Attention

Similar to additive attention, the multiplicative mechanism can also be leveraged to compute the importance of feature vectors. The basic idea of multiplicative attention was first introduced in machine translation [11]. Evolving from that, Vaswani et al. proposed a groundbreaking transformer architecture [10] which has been widely implemented in image processing [12, 13]. In recent research, transformers have been incorporated with the U-Net structure [14, 15] to improve medical image segmentation performance.

The attention function is described by matching a query vector q with a set of key vectors {k1, k2, ..., kn} to obtain the weights of the corresponding values {v1, v2, ..., vn}. Figure 4a shows an example for n = 4. Suppose the vectors q, ki, and vi have the same dimension \( {\mathbb{R}}^d \). Then, the attention function is

$$ {s}_i=\frac{{\boldsymbol{q}}^{\top }{\boldsymbol{k}}_i}{\sqrt{d}} $$
(6)

We note that the dot product can have large magnitude when d is large, which can cause gradient vanishing problem in the softmax function; si is normalized by the size of the vector to alleviate this. Equation 13.6 is a commonly used attention function in transformers. There are some other options including si = qki and si = qWki where W is a learnable parameter. Generally, the attention value si is determined by the similarity between the query and the key. Similar to the additive attention gate, these attention values are normalized to [0, 1] by a softmax function σ3:

$$ {\alpha}_i={\sigma}_3\left({s}_1,...,{s}_n\right)=\frac{e^{s_i}}{\sum_{j=1}^n{e}^{s_j}} $$
(7)
Fig. 4
A. a network diagram has q connected to keys K 1, 2, 3, and 4 and to S 1, 2, 3, and 4 of interconnecting sigma 3 that gives values V 1, 2, 3, and 4 at attention and SoftMax of sigma 3. B. a flowchart depicts scaled dot-product attention connecting linear v, k, and q, to concat.

(a) The dot-product attention gate. ki are the keys and q is the query vector. si are the outputs of the attention function. By using the softmax σ3, the attention coefficients αi are normalized to [0, 1] range. The output will be the weighted sum of values vi. (b) The multi-head attention is implemented in transformers. The input values, keys, and query are linearly projected to different spaces. Then the dot-product attention is applied on each space. The resultant vectors are concatenated by channel and passed through another linear transformation. Image (b) is adapted from [10]. Permission to reuse was kindly granted by the authors

The output of the attention gate will be \( \hat{\boldsymbol{v}}={\sum}_{i=1}^n{\alpha}_i{\boldsymbol{v}}_i \). In the transformer application, the values, keys, and queries are usually linearly projected into several different spaces, and then the attention gate is applied in each space as illustrated in Fig. 4b. This approach is called multi-head attention; it enables the model to jointly attend to information from different subspaces.

In practice, the value vi is often defined by the same feature vector as the key ki. This is why the module is also called multi-head self-attention (MSA). Chen et al. proposed the TransUNet [15], which leverages this module in the bottleneck of a U-Net as shown in Fig. 5. They argue that such a combination of a U-Net and the transformer achieves superior performance in multi-organ segmentation tasks.

Fig. 5
An illustration of the architecture of TransNet has the following parts. 1. An embedded sequence has layer norm, M S A, and M L P, 2. M R I scan of the human brain, 3. a box of C N N, hidden feature, and linear projection, and 4. 1 by 8, 1 by 4, and 1 by 2 of a cubic box.

The architecture of TransUNet. The transformer layer represented by the yellow box shows the application of multi-head attention (MSA). MLP represents the multilayer perceptron. In general, the feature vectors in the bottleneck of the U-Net are set as the input to the stack of n transformer layers. As these layers will not change the dimension of the features, they are easy to be implemented and will not affect other parts of the U-Net model. Replicated from [15] (CC BY 4.0)

2.1.3 Loss Functions for Segmentation Tasks

This section summarizes some of the most widely used loss functions for medical image segmentation (Fig. 6) and describes their usage in different scenarios. A complementary reading material for an extensive list of loss functions can be found in [16, 17]. In the following, the predicted probability by the segmentation model and the ground truth at the ith pixel/voxel are denoted as pi and gi, respectively. N is the number of voxels in the image.

Fig. 6
A network diagram links the following. Cross-entropy, W C E, D P C E, topK, and focal loss to E L L, dice C E, focal, topK, dice, Tversky, G D, p G D, S S, boundary, H D loss at distribution-based, compound, region-based, and boundary-based loss.

Loss functions for medical image segmentation. WCE: weighted cross-entropy loss. DPCE: distance map penalized cross-entropy loss. ELL: exponential logarithmic loss. SS: sensitivity-specificity loss. GD: generalized Dice loss. pGD: penalty loss. Asym: asymmetric similarity loss. IoU: intersection over union loss. HD: Hausdorff distance loss. Ⓒ2021 Elsevier. Reprinted, with permission, from [16]

Cross-Entropy Loss

Cross-entropy (CE) is defined as a measure of the difference between two probability distributions for a given random variable or set of events. This loss function is used for pixel-wise classification in segmentation tasks:

$$ {\ell}_{CE}=-\sum \limits_i^N\sum \limits_k^K{y}_i^k\log \left({p}_i^k\right) $$
(8)

where N is the number of voxels, K is the number of classes, \( {y}_i^k \) is a binary indicator that shows whether k is the correct class, and \( {p}_i^k \) is the predicted probability for voxel i to be in kth class.

Weighted Cross-Entropy Loss

Weighted cross-entropy (WCE) loss is a variant of the cross-entropy loss to address the class imbalance issue. Specifically, class-specific coefficients are used to weigh each class differently, as follows:

$$ {\ell}_{WCE}=-\sum \limits_i^N\sum \limits_k^K{w}_{y_k}{y}_i^k\log \left({p}_i^k\right) $$
(9)

Here, \( {w}_{y_k} \) is the coefficient for the kth class. Suppose there are 5 positive samples and 12 negative samples in a binary classification training set. By setting w0 = 1 and w1 = 2, the loss would be as if there were ten positive samples.

Focal Loss

Focal loss was proposed to apply a modulating term to the CE loss to focus on hard negative samples. It is a dynamically scaled CE loss, where the scaling factor decays to zero as confidence in the correct class increases. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples:

$$ {\ell}_{Focal}=-\sum \limits_i^N{\alpha}_i{\left(1-{p}_i\right)}^{\gamma}\log \left({p}_i\right) $$
(10)

Here, αi is the weighing factor to address the class imbalance and γ is a tunable focusing parameter (γ > 0).

Dice Loss

The Dice coefficient is a widely used metric in the computer vision community to calculate the similarity between two binary segmentations. In 2016, this metric was adapted as a loss function for 3D medical image segmentation [2]:

$$ {\ell}_{Dice}=1-\frac{2{\sum}_i^N{p}_i{g}_i+1}{\sum_i^N\left({p}_i+{g}_i\right)+1} $$
(11)

Generalized Dice Loss

Generalized Dice loss (GDL) [18] was proposed to reduce the well-known correlation between region size and Dice score:

$$ {L}_{GDL}=1-2\frac{\sum_{l=1}^2{w}_l{\sum}_i^N{p}_i{g}_i}{\sum_{l=1}^2{w}_l{\sum}_i^N{p}_i+{g}_i} $$
(12)

Here \( {w}_l=\frac{1}{{\left({\sum}_i^N{g}_{li}\right)}^2} \) is used to provide invariance to different region sizes, i.e., the contribution of each region is corrected by the inverse of its volume.

Tversky Loss

The Tversky loss [19] is a generalization of the Dice loss by adding two weighting factors α and β to the FP (false positive) and FN (false negative) terms. The Tversky loss is defined as

$$ {L}_{Tversky}=1-\frac{\sum_i^N{p}_i{g}_i}{\sum_i^N{p}_i{g}_i+\alpha \left(1-{g}_i\right){p}_i+\beta \left(1-{p}_i\right){g}_i} $$
(13)

Recently, a comprehensive study [16] of loss functions on medical image segmentation tasks shows that using Dice-related compound loss functions, e.g., Dice loss +  CE loss, is a better choice for new segmentation tasks, though none of losses can consistently achieve the best performance on multiple segmentation tasks. Therefore, for a new segmentation task, we recommend the readers to start with Dice +  CE loss, which is also the default loss function in one of the most popular medical image segmentation frameworks, nnU-Net [6].

Finally, note that other loss functions have also been proposed to introduce prior knowledge about size, topology, or shape, for instance [20].

2.1.4 Early Stopping

Given a loss function, a simple strategy for training is to stop the training process once a predetermined maximum number of iterations are reached. However, too few iterations would lead to an under-fitting problem, while over-fitting may occur with too many iterations. “Early stopping” is a potential method to avoid such issues. The training set is split into training and validation sets when using the early stopping condition. The early stopping condition is based on the performance on the validation set. For example, if the validation performance (e.g., average Dice score) does not increase for a number of iterations, the early stopping condition is triggered. In this situation, the best model with the highest performance on the validation set is saved and used for inference. Of course, one should not report the validation performance for the validation of the model. Instead, one should use a separate test set which is kept unseen during training for an unbiased evaluation.

2.1.5 Evaluation Metrics for Segmentation Tasks

Various metrics can quantitatively evaluate different aspects of a segmentation algorithm. In a binary segmentation task, a true positive (TP) indicates that a pixel in the target object is correctly predicted as target. Similarly, a true negative (TN) represents a background pixel that is correctly identified as background. On the other hand, a false positive (FP) and a false negative (FN) refer to a wrong prediction for pixels in the target and background, respectively. Most of the evaluation metrics are based upon the number of pixels in these four categories.

Sensitivity measures the completeness of positive predictions with regard to the positive ground truth (TP +  FN). It thus shows the model’s ability to identify target pixels. It is also referred to as recall or true-positive rate (TPR). It is defined as

$$ \mathrm{Sensitivity}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$
(14)

As the negative counterpart of sensitivity, specificity describes the proportion of negative pixels that are correctly predicted. It is also referred to as true-negative rate (TNR). It is defined as

$$ \mathrm{Specificity}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} $$
(15)

Specificity can be difficult to interpret because TN is usually very large. It can even be misleading as TN can be made arbitrarily large by changing the field of view. This is due to the fact that the metric is computed over pixels and not over patients/controls like in classification tasks (the number of controls is fixed). In order to provide meaningful measures of specificity, it is preferable to define a background region that has an anatomical definition (for instance, the brain mask from which the target is subtracted) and does not include the full field of view of the image.

Positive predictive value (PPV), also known as precision, measures the correct rate among pixels that are predicted as positives:

$$ \mathrm{PPV}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} $$
(16)

For clinical interpretation of segmentation, it is often useful to have a more direct estimation of false negatives. To that purpose, one can report the false discovery rate:

$$ \mathrm{FDR}=1-\mathrm{PPV}=\frac{\mathrm{FP}}{\mathrm{TP}+\mathrm{FP}} $$
(17)

which is redundant with PPV but may be more intuitive for clinicians in the context of segmentation.

Dice similarity coefficient (DSC) measures the proportion of spatial overlap between the ground truth (TP+FN) and the predicted positives (TP+FP). Dice similarity is the same as the F1 score, which computes the harmonic mean of sensitivity and PPV:

$$ \mathrm{DSC}=\frac{2\mathrm{TP}}{2\mathrm{TP}+\mathrm{FN}+\mathrm{FP}} $$
(18)

Accuracy is the ratio of correct predictions:

$$ \mathrm{Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} $$
(19)

As was the case in specificity, we note that there are many segmentation tasks where the target anatomical structure is very small (e.g., subcortical structures); hence, the foreground and background have unbalanced number of pixels. In this case, accuracy can be misleading and display high values for poor segmentations. Moreover, as for the case of specificity, one needs to define a background region in order for TN, and thus accuracy, not to vary arbitrarily with the field of view.

The Jaccard index (JI), also known as the intersection over union (IoU), measures the percentage of overlap between the ground truth and positive prediction relative to the union of the two:

$$ \mathrm{JI}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}+\mathrm{FN}} $$
(20)

JI is closely related to the DSC. However, it is always lower than the DSC and tends to penalize more severely poor segmentations.

There are also distance measures of segmentation accuracy which are especially relevant when the accuracy of the boundary is critical. These include the average symmetric surface distance (ASSD) and the Hausdorff distance (HD). Suppose the surface of the ground truth and the predicted segmentation are \( \mathcal{S} \) and \( {\mathcal{S}}^{\prime } \), respectively. For any point \( \boldsymbol{p}\in \mathcal{S} \), the distance from p to surface \( {\mathcal{S}}^{\prime } \) is defined by the minimum Euclidean distance:

$$ d\left(\boldsymbol{p},{\mathcal{S}}^{\prime}\right)=\underset{\boldsymbol{p}^{\prime}\in \mathcal{S}^{\prime }}{\min}\parallel \boldsymbol{p}-{\boldsymbol{p}}^{\prime}\parallel {}_2 $$
(21)

Then the average distance between \( \mathcal{S} \) and \( {\mathcal{S}}^{\prime } \) is given by averaging over \( \mathcal{S} \):

$$ d\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)=\frac{1}{N_S}\sum \limits_{i=1}^{N_S}d\left({\boldsymbol{p}}_i,{\mathcal{S}}^{\prime}\right) $$
(22)

Note that \( d\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)\ne d\left({\mathcal{S}}^{\prime },\mathcal{S}\right) \). Therefore, both directions are included in ASSD so that the mean of the surface distance is symmetric:

$$ \mathrm{ASSD}=\frac{1}{N_S+{N}_{S\prime }}\left[\sum \limits_{i=1}^{N_S}d\left({\boldsymbol{p}}_i,{\mathcal{S}}^{\prime}\right)+\sum \limits_{j=1}^{N_{S\prime }}d\Big({\boldsymbol{p}}_j^{\prime },\mathcal{S}\Big)\right] $$
(23)

The ASSD tends to obscure localized errors when the segmentation is decent at most of the points on the boundary. The Hausdorff distance (HD) can better represent the error, by, instead of computing the average distance to a surface, computing the maximum distance. To that purpose, one defines

$$ h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)=\underset{\boldsymbol{p}\in \mathcal{S}}{\max }d\left(\boldsymbol{p},{\mathcal{S}}^{\prime}\right) $$
(24)

Note that, again, \( h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right)\ne h\left({\mathcal{S}}^{\prime },\mathcal{S}\right) \). Therefore, both directions are included in HD so that the distance is symmetric:

$$ \mathrm{HD}=\max \left(h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right),h\left({\mathcal{S}}^{\prime },\mathcal{S}\right)\right) $$
(25)

HD is more sensitive than ASSD to localized errors. However, it can be too sensitive to outliers. Hence, using the 95th percentile rather than the maximum value for computing \( h\left(\mathcal{S},{\mathcal{S}}^{\prime}\right) \) is a good option to alleviate the problem.

Moreover, there are some volume-based measurements that focus on correctly estimating the volume of the target structure, which is essential for clinicians since the size of the tissue is an important marker in many diseases. Denote the ground truth volume as V  while the prediction volume as V. There are a few expressions for the volume difference. (1) The unsigned volume difference: |V− V |. (2) The normalized unsigned difference: \( \frac{\mid {V}^{\prime }-V\mid }{V} \). (3) The normalized signed difference: \( \frac{V^{\prime }-V}{V} \). (4) Pearson’s correlation coefficient between the ground truth volumes and the predicted volumes: \( \frac{\mathrm{Cov}\left(V,{V}^{\prime}\right)}{\sqrt{\mathrm{Var}(V)}\sqrt{\mathrm{Var}\left({V}^{\prime}\right)}} \). Nevertheless, note that, while they are useful, these volume-based metrics can also be misleading (a segmentation could be wrongly placed while providing a reasonable volume estimate) when used in isolation. They thus need to be combined with overlap metrics such as Dice.

Finally, some recent guidelines on validation of different image analysis tasks, including segmentation, were published in [21].

2.1.6 Pre-processing for Segmentation Tasks

Image pre-processing is a set of sequential steps taken to improve the data and prepare it for subsequent analysis. Appropriate image pre-processing steps often significantly improve the quality of feature extraction and the downstream image analysis. For deep learning methods, they can also help the training process converge faster and achieve better model performance. The following sections will discuss some of the most widely used image pre-processing techniques.

Skull Stripping

Many neuroimaging applications often require preliminary processing to isolate the brain from extracranial or non-brain tissues from MRI scans, commonly referred to as skull stripping. Skull stripping helps reduce the variability in datasets and is a critical step prior to many other image processing algorithms such as registration, segmentation, or cortical surface reconstruction. In literature, skull stripping methods are broadly classified into five categories: mathematical morphology-based methods [22], intensity-based methods [23], deformable surface-based methods [24], atlas-based methods [25], and hybrid methods [26]. Recently, deep learning-based skull stripping methods have been proposed [27,28,29,30,31,32] to improve the accuracy and efficiency. A detailed discussion of the merits and limitations of various skull stripping techniques can be found in [33].

Bias Field Correction

The bias field refers to a low-frequency and very smooth signal that corrupts MR images [34]. These artifacts, often described as shading or bias, can be generated by imperfections in the field coils or by magnetic susceptibility changes at the boundaries between anatomical tissue and air. This bias field can significantly degrade the performance of image processing algorithms that use the image intensity values. Therefore, a pre-processing step is usually required to remove the bias field. The N4 bias field correction algorithm [35] is one of the most widely used methods for this purpose, as it assumes a simple parametric model and does not require tissue classification.

Data Harmonization

Another challenge of MRI data is that it suffers from significant intensity variability due to several factors such as variations in hardware, reconstruction algorithms, and acquisition settings. This is also due to the fact that most MR imaging sequences (e.g., T1-weighted, T2-weighted) are not quantitative (the voxel values can only be interpreted relative to each other). Such differences can often be pronounced in multisite studies, among others. This variability can be problematic because intensity-based models may not generalize well to such heterogeneous datasets. Any resulting data can suffer from significant biases caused by acquisition details rather than anatomical differences. It is thus desirable to have robust data harmonization methods to reduce unwanted variability across sites, scanners, and acquisition protocols. One of the popular MRI harmonization methods is a statistical approach named the combined association test (comBat). This method was shown to exhibit a good capacity to remove unwanted site biases while preserving the desired biological information [36]. Another popular method is a deep learning-based image-to-image translation model, CycleGAN [37]. The CycleGAN and its variants do not require paired data, and thus the training process is unsupervised in the context of data harmonization.

Intensity Normalization

Intensity normalization is another important step to ensure comparability across images. In this section, we discuss common intensity normalization techniques. Readers can refer to the work [38] in which the author explores the impact of different intensity normalization techniques on MR image synthesis.

Z-Score Normalization

The basic Z-score normalization on the entire image is also called the whole-brain normalization. Given the mean μ and standard deviation σ from all voxels in a brain mask B, Z-score normalization can be performed for all voxels in image I as follows:

$$ {I}_{z- score}(x)=\frac{I(x)-\mu }{\sigma } $$
(26)

While straightforward to implement, whole-brain normalization is known to be sensitive to outliers.

White Stripe Normalization

White stripe normalization [39] is based on the parameters obtained from a sample of normal-appearing white matter (NAWM) and is thus robust to local intensity outliers such as lesions. The NAWM is obtained by smoothing the histogram of the image I and selecting the mode of the distribution. For T1-weighted MRI, the “white stripe” is defined as the 10% of intensity values around the mean of NAWM μ. Let F(x) be the CDF of the specific MR image I(x) inside the brain mask B, and τ = 5%. The white stripe Ωτ is defined as

$$ {\Omega}_{\tau }=\left\{I(x)|{F}^{-1}\left(F(x)-\tau \right)<I(x)<{F}^{-1}\left(F(x)+\tau \right)\right\} $$
(27)

Then let στ be the sample standard deviation associated with Ωτ. The white stripe normalized image is

$$ {I}_{ws}(x)=\frac{I(x)-\mu }{\sigma_{\tau }} $$
(28)

Compared to the whole-brain normalization, the white stripe normalization may work better and have better interpretation, especially for applications where intensity outliers such as lesions are expected.

Segmentation-Based Normalization

Segmentation-based normalization uses a segmentation of a specified tissue, such as the cerebrospinal fluid (CSF), gray matter (GM), or white matter (WM), to normalize the entire image to the mean of the tissue. Let T ⊂ B be the tissue mask for image I. The tissue mean can be calculated as \( \mu =\frac{1}{\mid T\mid }{\sum}_{t\in T}I(t) \) and the segmentation-based normalized image is expressed as

$$ {I}_{seg}(x)=\frac{cI(x)}{\mu } $$
(29)

where \( c\in {\mathbb{R}}^{+} \) is a constant.

Kernel Density Estimate Normalization

Kernel density estimate (KDE) normalization estimates the empirical probability density function of the intensities of the entire image I over the brain mask B via kernel density estimation. The KDE of the probability density function for the image intensities can be expressed as

$$ \hat{p}(x)=\frac{1}{\mathrm{HWD}\times \delta}\sum \limits_{i=1}^{\mathrm{HWD}}K\left(\frac{x-{x}_i}{\delta}\right) $$
(30)

where H, W, D are the image sizes of I, x is an intensity value, K is the kernel, and δ is the bandwidth parameter which scales the kernel. With KDE normalization, the mode of WM can be selected more robustly via a smooth version of the histogram and thus is more suitable to be used in a segmentation-based normalization method.

Spatial Normalization

Spatial normalization aims to register a subject’s brain image to a common space (reference space) to allow comparisons across subjects. When the reference space is a standard space, such as the Montreal Neurological Institute (MNI) space [40] or the Talairach and Tournoux atlas (Talairach space), the registration also facilitates the sharing and interpretation of data across studies. It is also common practice to define a customized space from a dataset rather than using a standard space. For deep learning methods, it has been shown that training data with appropriate spatial normalization tend to yield better performances [41,42,43]. Rigid, affine, or deformable registration may be desirable for spatial normalization, depending on the application. Many registration methods are publicly available through software packages such as 3D Slicer, FreeSurfer [https://surfer.nmr.mgh.harvard.edu/], FMRIB Software Library (FSL) [https://fsl.fmrib.ox.ac.uk/fsl/fslwiki], and Advanced Normalization Tools (ANTs) [https://picsl.upenn.edu/software/ants/].

2.2 Supervision Settings

In the following three sections, we categorize the learning-based segmentation algorithms by their supervision setting. In the reverse order of the amount of annotation required, these include supervised, semi-supervised, and unsupervised methods (Fig. 7). For supervised methods, we mainly present some training strategies and model architectures that will help improve the segmentation performance. For the other two types of approaches, we classify the mainstream ideas and then provide application examples proposed in recent research.

Fig. 7
A flowchart illustrates medical image segmentation of supervised, semi-supervised, and unsupervised learning with the types on the top. It lists labeled data, model, supervised loss of prediction, and label.

Overview of the supervision settings for medical image segmentation. Best viewed in color

2.3 Supervised Methods

2.3.1 Background

In supervised learning, a model is presented with the given dataset \( \mathcal{D}={\left\{\left({x}^{(i)},{y}^{(i)}\right)\right\}}_{i=1}^n \) of inputs x and associated labels y. This y can take several forms, depending on the learning task. In particular, for fully convolutional neural network-based segmentation applications, y is a segmentation map. In supervised learning, the model can learn from labeled training data by minimizing the loss function and apply what it has learned to make a prediction/segmentation in testing data. Supervised training thus aims to find model parameters θ that best predict the data based on a loss function \( L\left(y,\hat{y}\right) \). Here, \( \hat{y} \) denotes the output of the model obtained by feeding a data point x to the function f(x;θ) that represents the model. Given sufficient training data, supervised methods can generally perform better than semi-supervised or unsupervised segmentation methods.

2.3.2 Data Representation

Data is an important part of supervised segmentation models, and the model performance relies on data representation. In addition to image pre-processing (Subheading 2.1.6), there are a few key steps for data preparation before being fed into the segmentation network.

Patch Formulation

The inputs of CNN can be represented as image patches when the whole image is too large and would require too much GPU memory. The image patches could be 2D slices, 3D patches, and any format in between. The choice of patches would affect the performance of networks for a given dataset and task [44]. Compared to 3D patches, 2D slices have the advantage of lighter computational load during training. However, contextual information along the third axis is missing. In contrast, 3D patches leverage data from all three axes, but they require more computational resources. As a compromise between 2D and 3D patches, “2.5D” approaches have been proposed, by taking 2D slices in all three orthogonal views through the same voxel [45]. Those 2D slices could be trained in a single CNN or a separate CNN for each view. Furthermore, Zhang et al. [46] proposed 2.5D stacked slices to leverage the information from adjacent slices in each view.

Patch Extraction

Due to the imbalance between foreground and background, various patch extraction strategies have been designed to obtain robust segmentation. Kamnitsas et al. [47], Dolz et al. [48], and Li et al. [49] pick a voxel within the foreground or background with 50% probability at every iteration during training and select the patch centered at that voxel. In [46], Zhang et al. extract 2.5D stacked patches if the central slice contains the foreground, even with only one voxel. In some models [50, 51], 3D patches with target structure are used as input instead of the whole image, which could reduce the effect of the background for segmenting target structures with smaller volume.

Data Augmentation

To avoid the over-fitting problem and increase the generalizability of the model, data augmentation (DA) is widely used in medical image segmentation [52]. The common DA strategies could be classified into three categories: (1) spatial augmentation, (2) image appearance augmentation, and (3) image quality augmentation. For spatial augmentation, random image flip, rotation, scale, and deformation are often used [4, 45, 53,54,55]. Random gamma correction, intensity scale, and intensity shift are the common forms for image appearance augmentation [51, 54, 56, 57]. Image quality augmentation includes random Gaussian blur, random noise addition, and image sharpening [51, 56]. Note that while we only list a few commonly used methods here, many others have been explored. TorchIO [58] is a widely used software package for data augmentation.

2.3.3 Network Architecture

Here, we classify the popular supervised segmentation networks into single/multipath networks and encoder-decoder networks.

Single/Multipath Networks

As discussed above, patches are often used as input instead of the entire image, resulting in a lack of global context. This could produce noisy segmentations, such as undesired islands of false-positive voxels that need to be removed in post-processing [48]. To compensate for the missing global context, Li et al. [49] used spatial coordinates as additional channels of input patches. A multipath network is another feasible solution (Fig. 8). Multipath networks usually contain global and local paths [47, 59, 60] that extract different features at different scales. The global path uses convolutions with larger kernel size [60] or a larger receptive field [47] to learn global information [47]. In contrast, local features are extracted in the local path. The global path thus extracts global features and tends to locate the position of the target structure. In contrast, the shape, size, texture, boundary, and other details of the target structure are identified by the local path. However, the performance of this type of network is easily affected by the size and design of input patches: for example, too small patches would not provide enough information, while too large patches would be computationally prohibitive.

Fig. 8
A process flow of single-path and multipath networks has the following layers. Brain M R I, input segment, convolutional layer, and classification layer. In the multipath network, fully connected layers are between convolutional layers and classification layer.

Examples of single-path (top) and multipath (bottom) networks. In the multipath network, the inputs for the two pathways are centered at the same location. The top pathway is equivalent to the single-path network and takes the normal resolution image as input, while the bottom pathway takes a downsampled image with larger field of view as input. Replicated from [47] (CC BY 4.0)

U-Net and Its Variants

To tackle the limitations of the single/multipath networks, many models use U-net variants with encoder-decoder paths [1, 61], which establishes end-to-end training from image to segmentation map. The encoder is similar to the single/multipath networks but with downsampling operations between the different scales of feature maps. The decoder leverages the extracted features from the encoder and produces a segmentation of the same size as the original image. Skip connections that pass the feature maps from the encoder directly to the decoder contribute to the performance of the U-net. The passed information could help to recover the details of segmentation.

The most common modification of the U-Net is the introduction of other convolutional modules, such as residual blocks [62], dense blocks [63], attention modules [3, 4], etc. These convolutional modules could replace regular convolution operations or be used in the skip connections of the U-Net. Residual blocks could mitigate the gradient vanishing problem during training by adding the input of the module to its output, which also contributes to the speed of convergence [62]. In this configuration, the network can be built deeper. The work of [53, 59, 64,65,66] used residual connections or residual blocks instead of regular convolutions in their network architecture for robust segmentation of various brain structures. Dense blocks could strengthen feature propagation and encourage feature reuse to improve segmentation accuracy. However, they require more computational resources during training. Zhang et al. [46, 56] employed the Tiramisu network [67], a densely U-shaped network, to produce superior multiple sclerosis (MS) lesion segmentation.

The attention module is another commonly used tool in segmentation to focus on salient features [4]. It can be categorized into spatial attention and channel attention modules. Li et al. [53] use spatial attention modules in the skip connections for extracting smaller subcortical structures. Similarly, attention modules are used between skip connections and in the decoder part in the work of [51, 68] for segmenting vestibular schwannoma and cochlea. In addition, Zhang et al. [69] proposed to use slice-wise attention networks in 3D CNNs for MS segmentation. Applying the slice-wise attention in three different orientations improves the computational efficiency compared to the regular attention module. Hou et al. [70] proposed the cross-attention block, which combines channel attention and spatial attention. Moreover, in [71], a skip attention unit is used for brain tumor segmentation. Zhou et al. [72] build fusion blocks based on the attention module. Attention modules have also been used for brain tumor segmentation [73].

Transformers

As discussed in Subheading 2.1.2, transformers have become popular in medical image segmentation [74,75,76]. Transformers leverage the long-range dependencies and can better capture low-level details. In practice, they can replace CNNs [77], be combined with CNNs [78, 79], or integrated into CNNs [80]. Some recent works [14, 15, 77] have shown that the implementation of transformer on U-Net architecture can achieve superior performance in medical image segmentation compared to their CNN counterparts.

2.3.4 Framework Configuration

The single network mainly focuses on a single task during training and may ignore other potentially useful information. To improve the segmentation accuracy, frameworks with multiple encoders and decoders have been proposed [53, 81, 82].

Multi-task Networks

As the name suggests, multi-task networks attempt to simultaneously tackle a main task as well as auxiliary tasks, rather than focusing on a single segmentation task. These networks usually contain a shared encoder and multiple decoders for multiple tasks, which could help deal with class imbalance (Fig. 9). Compared to a single-task network, the learning ability of the encoder is increased from same domain tasks (e.g., multiple tasks of multiple decoders), which could improve segmentation performance. Simultaneously learning multiple tasks could also improve model generalizability. McKinley et al. [81] leverage the information of additional tissue types to increase the accuracy of MS lesion segmentation. Another common multi-task setting is to introduce an auxiliary reconstruction task [57].

Fig. 9
An illustration of the multi-task framework along with 3-D M R I scans as input on the top. On the bottom, square box = group norm, R e L U, conv 3 multiple 3 multiple 3 group norm R e L U conv 3 multiple 3 multiple 3.

Example of multi-task framework. The model takes four 3D MRI sequences (T1w, T1c, T2w, and FLAIR) as input. The U-Net structure (the top pathway with skip connection) serves as the segmentation network, and the output contains the segmentation maps of the three subregions (whole tumor (WT), tumor core (TC), and enhancing tumor (ET)). An auxiliary VAE branch (the bottom decoder) that reconstructs the input images is applied in the training stage to regularize the shared encoder. Ⓒ2019 Springer Nature. Reprinted, with permission, from [57]

Cascaded Networks

A cascaded network is a series of connected networks such that the input of each downstream network is the output from an upstream network (Fig. 10). For example, a coarse-to-fine segmentation strategy can be used to reduce the high computational cost of training for 3D images [50, 53]. In this scenario, an upstream network could take downsampled images as input to roughly locate the target structures, allowing the images to be cropped to the region of interest for the downstream network. The downstream network could then produce high-quality segmentation in full resolution. Another advantage of this approach is to reduce the impact of volume imbalance between foreground and background classes. However, the upstream network would determine the performance of the whole framework, and some global information is missing in the downstream networks.

Fig. 10
An illustration of image cascaded network. The input multi-model volumes with W net points to segmentation of the whole tumor with T net, to segmentation of tumor core with E net, and then to segmentation of enhancing tumor core.

Example of cascaded networks. WNet segments the whole tumor from the input multimodal 3D MRI. Then based upon the segmentation, a bounding box (yellow dash line) can be obtained and used to crop the input. The TNet takes the cropped image to segment the tumor core. Similarly, the ENet segments the enhancing tumor core by taking the cropped images determined by the segmentation from the previous stage. Ⓒ2018 Springer Nature. Reprinted, with permission, from [50]

Ensemble Networks

To obtain a robust segmentation, a popular approach is to aggregate the output from multiple independent networks (i.e., no weights/parameters shared). Kanitsas et al. proposed the ensemble of multiple models and architectures (EMMA) [83] for brain tumor segmentation. Kao et al. [84] produce segmentation using 26 ensemble neural networks. Zhao et al. [85] proposed a framework for 3D segmentation with multiple 2D networks that take input from different views. Huo et al. [82] proposed the spatially localized atlas network tiles (SLANT) method to distribute multiple networks for 3D high-resolution whole-brain segmentation. Among their variants, SLANT-27 (Fig. 11), which ensembles 27 networks, produces the best result. Last but not least, many medical image segmentation challenge participants use model ensembling to achieve high performance.

Fig. 11
An illustration of an ensemble network has the following parts. 1. M R I scans with T 1 w M R I volume, 2. spatially localized atlas network tile framework, 3. network tiles, and 4. spatially localized sub-space, 3 D network for one tile, and segmentation.

SLANT-27: An example of ensemble networks. The whole brain is split into 27 overlapping subspaces with regard to their spatial locations (yellow cube). For each location, there is an independent 3D fully convolutional network (FCN) for segmentation (blue cube). The ensemble is achieved by label fusion on overlapping locations. Ⓒ2019 Elsevier. Reprinted, with permission, from [82]

2.3.5 Multiple Modalities and Timepoints

Many neuroimaging studies contain multiple modalities or multiple timepoints per subject. This additional information is clearly valuable and can be leveraged to improve segmentation performance.

Multiple Modalities

Different imaging modalities offer different visualizations of various tissue types. Multi-modality datasets can be thus leveraged to improve segmentation accuracy. For example, Zhang et al. [86] proposed a framework with two independent networks that take two different modalities as inputs. Instead of combining single modality networks, Zhang et al. [46] concatenate multi-modality data as different channels of inputs. However, not all modalities are available in clinical practice: (1) the MRI sequences can vary between different imaging sites and (2) some modalities may be unusable due to poor image quality. This is known as the missing modality problem. To tackle this problem, Havaei et al. [87] proposed a deep learning method that is robust to missing modalities for brain tumor and MS segmentation, which contains an abstraction layer that transforms feature maps into statistics to help learning during training. In [88], the authors further improved modality dropout by introducing dynamic filters and co-training strategy for MS lesion segmentation. In [89, 90], the authors used knowledge distillation scheme to transfer the knowledge from full-modality data to each missing condition with individual models.

Multiple Timepoints

Data from multiple timepoints are important for tracking the longitudinal changes in a single subject. The additional timepoints can also be used as temporal context to improve the segmentation for each timepoint. In [45], longitudinal data are concatenated as a multichannel input to improve segmentation. In the work of [91], the stacked convolutional long short-term memory modules (C-LSTMs) are integrated into CNN for 4D medical image segmentation, which allows the model to learn the correlation and overall trends from longitudinal data. Li et al. [92] also proposed a framework with C-LSTM modules for segmenting longitudinal data jointly.

2.4 Semi-supervised Methods

2.4.1 Background

Given a considerable amount of labeled data, deep learning-based methods have achieved state-of-the-art performances in various medical image analysis applications. However, it is a laborious and time-consuming process to obtain dense pixel/voxel-level annotations for segmentation tasks. Since accurate annotations require expertise in medical domain, they are also expensive to collect. It is therefore desirable to leverage unlabeled data alongside the labeled data to improve model performance, an approach typically known as semi-supervised learning (SSL). Intuitively, these unlabeled data can provide critical information on the data distribution and thus can be used to improve model robustness by exploring this distribution.

Conceptually, SSL falls in between supervised learning (fully labeled data) and unsupervised learning (no labeled data). In SSL, we have access to both a labeled dataset \( {\mathcal{D}}_L=\left\{\left({x}_l^{(i)},{y}_l^{(i)}\right)|i=1,2,\cdots \kern0.3em ,{n}_l\right\} \), where \( {y}_l^{(i)} \) is the ith manually annotated ground truth mask in the context of segmentation task, and an unlabeled dataset \( {\mathcal{D}}_U=\left\{\right({x}_u^{(i)}\mid i=1,2,\cdots \kern0.3em ,{n}_u\Big\} \). Typically, nu ≫ nl. The main objective of SSL is to train a segmentation network X by leveraging both \( {\mathcal{D}}_L \) and \( {\mathcal{D}}_U \) to surpass the performances achieved by solely supervised learning with \( {\mathcal{D}}_L \) or unsupervised learning with \( {\mathcal{D}}_U \).

According to [93], there are mainly three underlying assumptions held by SSL: (1) smoothness assumption, (2) low-density assumption, and (3) cluster assumption. The smoothness assumption states that the data points that are close by in the input or latent space should have similar or identical labels. With this assumption, we can expect the labels of unlabeled data to be similar to those of labeled data when these samples are similar in input or latent space, i.e., the labels from the labeled dataset can be transferred to the unlabeled dataset. In the low-density assumption, we assume that the decision boundary of a classifier should ideally not pass through the high density of the marginal data distribution. Placing the decision boundary in a high-density region would violate the smoothness assumption because the labels would be more likely to be dissimilar for similar data points. Lastly, the cluster assumption states that each cluster of data points should belong to the same class. This assumption is necessary because if the data points from the unlabeled and labeled datasets cannot be meaningfully clustered, the unlabeled data cannot be used to improve the model performance trained from only the labeled data.

2.4.2 Overview of Semi-supervised Techniques

In the semi-supervised learning literature, most of the techniques are originally designed and validated in the context of classification tasks. However, these methods can be readily adapted to segmentation tasks since a segmentation task can be viewed as pixel-wise classification. In this chapter, we mainly categorize the SSL approaches into three techniques, namely, (1) consistency regularization, (2) entropy minimization, and (3) self-training. However, most existing SSL approaches often employ a combination of these techniques rather than a single one, as summarized in Table 1. In the following sections, we will discuss each approach in detail and introduce some of the most important SSL techniques alongside.

Table 1 Summary of classic semi-supervised learning methods

2.4.3 Consistency Regularization

In semi-supervised learning, consistency regularization has been widely used as a technique to make use of unlabeled data. The idea of consistency regularization is based on the smoothness assumption that the network outputs should remain the same even if the input data is perturbed slightly (i.e., do not vary dramatically in the input space). The consistency between the predictions of an unlabeled sample and its perturbed counterpart can be used as a supervision mechanism for training to leverage the unlabeled data. In such scenarios, we can formulate the semi-supervised training objective as follows:

$$ {\ell}_{SSL}=\sum \limits_{x_l,{y}_l\in {D}_L}{L}_S\left({x}_l,{y}_l\right)+\alpha \sum \limits_{x_u\in {D}_U}{L}_C\left({x}_u,{\tilde{x}}_u\right) $$
(31)

where LS is the supervised loss for labeled data. For segmentation tasks, LS can be one of the segmentation losses we presented in Subheading 2.1.3. xu and \( {\tilde{x}}_u \) are the unlabeled data and its perturbed version, respectively. LC is the consistency loss function. Mean squared error loss and KL divergence loss have been widely used as LC in the SSL literature. α is a balancing term to weigh the impact of consistency loss from unlabeled data.

It is worth noting that the random permutations involved in consistency regularization can be implemented in different ways. For instance, the Π model [95] encourages consistent network outputs between two versions of the same input data, i.e., with different data augmentation and different network dropout conditions. In this way, training can leverage the labeled data by optimizing the supervised segmentation loss and the unlabeled data by using this unsupervised consistency loss. In mean teacher [96], the authors propose to compute the consistency between the outputs of the student network and the teacher network (which uses the exponential moving average of the student network weights) from the same input data. In unsupervised data augmentation (UDA) [97], unlabeled data are augmented via different augmentation strategies such as RandAugment [100] and are fed to the same network to obtain two model predictions, which are used to compute the consistency loss. Similarly, in MixMatch [98], another very popular SSL method, an unlabeled image is augmented K times and the average of their outputs is sharpened, which is then used as the supervision signal to compute the consistency loss. Moreover, in FixMatch [99], the consistency loss is computed on the weakly and strongly augmented versions of the same input. In summary, consistency regularization has been widely used in various SSL techniques to leverage the unlabeled data.

Application: MTANS

MTANS [101] is an SSL framework for brain lesion segmentation. As shown in Fig. 12, the MTANS framework is built upon the mean teacher model [96] where both the teacher and the student models are used to segment the brain lesions as well as the signed distance maps of the object surfaces. As a variant of the mean teacher model, MTANS incorporates consistency regularization in the training strategy. Specifically, the authors propose to compute the multi-scale feature consistency as consistency regularization, while the traditional mean teacher model only computes the consistency at the output level. Besides, a discriminator network is used to extract hierarchical features and differentiate the signed distance maps obtained by labeled and unlabeled data. In experiments, MTANS is evaluated on three public brain lesion datasets including ISBI 2015 (multiple sclerosis) [102], ISLES 2015 (ischemic stroke) [103], and BRATS 2018 (brain tumor) [104]. Experimental results show that MTANS can outperform the supervised baseline and other competing SSL methods when trained with the same amount of labeled data.

Fig. 12
An illustration of the M T A N S framework of unlabeled, and labeled M R I sequences. It leads to various steps, including teacher and student model, student S D M, map, and segmentation region with unlabeled F L A I R, that leads to discriminator, segmentation, and lesion features.

An illustration of the MTANS framework. The blue solid lines indicate the path of unlabeled data, while the labeled data follows the black lines. The two segmentation models provide the segmentation map and the signed distance map (SDM). The discriminator is applied to check the consistency of the outputs from the teacher and student models. The parameters of the teacher model are updated according to the student model using the exponential moving average (EMA). Ⓒ2021 Elsevier. Reprinted, with permission, from [101]

2.4.4 Entropy Minimization

Entropy minimization is another important SSL technique and is often used together with consistency training. Generally, entropy is the measure of the disorder or the uncertainty of a system. In the context of SSL, this term often refers to the uncertainty in the pseudo-label obtained by the unlabeled data. Entropy minimization, also known as minimum entropy regularization, aims to encourage the model to produce high-confidence predictions. The idea of entropy minimization is built upon the low-density assumption as it requires the network to output low-entropy predictions on unlabeled data. The high-confidence pseudo-labels have been found very effective when used as the supervision for unlabeled data. For example, in MixMatch, the pseudo-label of the unlabeled data, i.e., the average predictions of K augmented samples, is “sharpened” by adjusting the prediction distribution. This sharpening process is an implicit way to minimize the entropy on the unlabeled data distribution. In pseudo-label [94], the authors propose to construct the hard (one-hot) pseudo-labels from the high-confidence predictions of the unlabeled data, which is another form of entropy minimization. In addition, the UDA method proposes to compute the consistency loss only when the highest probability in the predicted class is above a pre-defined threshold. Similarly, in FixMatch, the predictions of the weakly augmented unlabeled data are first filtered by a pre-defined threshold and later converted to a one-hot pseudo-label.

2.4.5 Self-training

Self-training is an iterative training process where the network uses the high-confidence pseudo-labels of the unlabeled data from previous training steps. Interestingly, it has been shown that self-training is equivalent to a version of the classification EM algorithm [105]. The ideas of self-training and consistency regularization are very similar. Here, we differentiate these two concepts as follows: for consistency regularization, the supervision signals of the unlabeled data are generated online, i.e., from the current training epoch; in contrast, for self-training, the pseudo-labels of unlabeled data are generated offline, i.e., generated from the previous training epoch/epochs. Typically, in self-training, the pseudo-labels produced from previous epochs need to be carefully processed before being used as the supervision, as they are crucial to the effectiveness of the self-training methods. In the SSL literature, pseudo-label [94] is a representative method that uses self-training. In pseudo-label, the network is first trained on the labeled data only. Then the pseudo-labels of the unlabeled data are obtained by feeding them to the trained model. Next, the top K predictions on the unlabeled data are used as the pseudo-labels for the next epoch. The training objective function of pseudo-label is as follows:

$$ {L}_{PL}=\sum \limits_{x_l,{y}_l\in {D}_L}{L}_S\left({x}_l,{y}_l\right)+\alpha (t)\sum \limits_{x_u\in {D}_U}{L}_S\left({x}_u,{\tilde{y}}_u\right) $$
(32)

where \( \overset{\sim }{y} \) is the pseudo-label and α(t) is a balancing term to weigh the importance of pseudo-label training. Particularly, α(t) is designed to slowly increase to help the optimization process to avoid poor local minima [94]. Note that both labeled and unlabeled data are trained in a supervised manner with ground truth labels yl and pseudo labels \( {\tilde{y}}_u \).

Application: 4S

In this study, the authors propose a sequential semi-supervised segmentation (4S) framework [106] for serial electron microscopy image segmentation. As shown in Fig. 13, 4S relies on the self-training strategy as it applies pseudo-labeling to all slices in the target continuous images, with only a small number of consecutive input slices. Specifically, a few labeled samples are used for the first round of training. The trained model is then used to generate pseudo-labels for the next sample. Afterward, the segmentation model is retrained using the pseudo-labels and produces new pseudo-labels for the next slices. This method was evaluated on the ISBI 2012 dataset (neural cell membranes) [107] and Japanese carpenter ant dataset (nestmate discriminant sensory elements) [108]. Results show that 4S has achieved better performance than the supervised learning-based method.

Fig. 13
A flowchart illustrates sequential semi-supervised segmentation of model prediction train and retrains with raw images, true, and pseudo labels of the slices with N-times.

The workflow of the 4S framework. Based on the assumption that consecutive images are strongly correlated, the manual annotations (true labels) are provided for the first few slices. These labeled data are used for the initial training. Then the model can provide the pseudo-labels for the next few slices which can be applied for retraining. Adapted from [106] (CC BY 4.0)

2.5 Unsupervised Methods

2.5.1 Background

As suggested in Subheadings 2.3 and 2.4, most deep segmentation models learn to map the input image x to the manually annotated ground truth y. Although semi-supervised approaches can drastically reduce the need for labels, low availability of ground truth is still a primary concern for the development of learning-based models. Another disadvantage of supervised learning approaches becomes evident when considering the anomaly detection/segmentation task: a model can only recognize anomalies that are similar to those in the training dataset and will likely fail with rare findings that may not appear in the training data [109].

Unsupervised anomaly detection (UAD) methods have been developed in recent years to tackle these problems. Since no ground truth labels are provided, the models are designed to capture the inherent discrepancy between healthy and pathological data distributions. The general idea is to represent the distribution of normal brain anatomy by a deep model that is trained exclusively on healthy subjects [109]. Consequently, the pathological subjects are out of the distribution modeled by the network. Usually, this neural network has an encoder-decoder architecture such that the output will be a reconstruction of the input image. Since not well represented by the training data, the abnormal region cannot be fully reconstructed. Hence, the pixel-wise reconstruction error can be used as an estimate of the anomalous region. Figure 14 illustrates this process.

Fig. 14
A and B, schematic representations of training, and inference stages have healthy, and testing images of the human brain leading to the encoder, Z, decoder, and reconstructions, with the residual image in b.

The general idea of unsupervised anomaly detection (UAD) realized by an auto-encoder. (a) Train the model with only healthy subjects. (b) Test with pathological samples. The residual image depicts the anomalies. Ⓒ2021 Elsevier. Reprinted, with permission, from [109]

The auto-encoder (AE) and its variations (Fig. 15) are widely used in the UAD problem. All these models generate a low-dimensional representation of the input image termed latent vector z at the bottleneck. Most of the research concentrates on manipulating the distribution of z so that the abnormal region can be “cured” in the reconstruction. This process is often referred to as image restoration (or sometimes image inpainting) in the computer vision literature. The following sections will discuss some mainstream approaches categorized by the model structure implemented.

Fig. 15
4 schematic representations of A E, V A E, A A E, and Ano V A E G A N. A. has beam z connecting healthy image, and reconstruction. b, c, and d. have the beam sigma, mu, and z between healthy image and reconstruction.

Variations of auto-encoder. (a) The auto-encoder. (b) The variational auto-encoder. (c) The adversarial auto-encoder includes a discriminator that provides constraint on the distribution of the latent vector z. (d) Anomaly detection VAEGAN introduces a discriminator to check whether the reconstructed image lies in the same distribution as the healthy image. Ⓒ2021 Elsevier. Reprinted, with permission, from [109]

2.5.2 Auto-encoders

The auto-encoder (AE) (Fig. 15a) is the simplest encoder-decoder structure. Let an encoder fθ and a decoder gϕ, where θ, ϕ are model parameters. Given a healthy input image \( {\boldsymbol{X}}^h\in {\mathbb{R}}^{D\times H\times W} \), the encoder learns to project it to a lower-dimensional latent space z = fθ(Xh), \( \boldsymbol{z}\in {\mathbb{R}}^L \). Then the decoder recovers the original image from the latent vector as \( {\hat{\boldsymbol{X}}}^h={g}_{\phi}\left(\boldsymbol{z}\right) \). The model is trained by minimizing the loss function \( \mathcal{L} \) that delineates the difference between the input and the reconstructed image:

$$ \underset{\theta, \phi }{\arg\ \min }{\mathcal{L}}_{\theta, \phi}\left({\boldsymbol{X}}^h,{\hat{\boldsymbol{X}}}^h\right)=\parallel {\boldsymbol{X}}^h-{\hat{\boldsymbol{X}}}^h\parallel {}_n $$
(33)

The 1-norm (n = 1) and 2-norm (mean squared error) (n = 2) are common choices for the loss function. The training stage is illustrated in Fig. 14a. When a sample with anomaly Xa is passed into the model, the abnormal region (e.g., lesion, tumor) cannot be well reconstructed in \( {\hat{\boldsymbol{X}}}^a \) as the model has never seen the anomaly in the healthy training data. In other words, the AE-based methods leverage the models’ dependence on training data to discern the region that is out of distribution. Figure 14b shows that the anomaly can be roughly represented by the reconstruction error \( \hat{\boldsymbol{Y}}=\mid {\boldsymbol{X}}^a-{\hat{\boldsymbol{X}}}^a\mid \).

Bayesian Auto-encoder

Pawlowski et al. [110] report a Bayesian convolutional auto-encoder to model the healthy data distribution. They introduce the model uncertainty and deem the reconstructed image as the Monte Carlo (MC) estimate. Let FΘ be the auto-encoder model with weights Θ and \( \mathcal{D} \) the training dataset. Then, the MC estimation can be expressed as

$$ {F}_{\Theta}\left(\boldsymbol{X}\right)=\int P\left(\boldsymbol{X}|\Theta \right)P\left(\Theta |\mathcal{D}\right)d\Theta \approx \frac{1}{N}\sum \limits_{i=1}^N{F}_{\Theta_i}\left(\boldsymbol{X}\right) $$
(34)

where \( {\Theta}_i\sim P\left(\Theta |\mathcal{D}\right) \). In practice, the authors apply the MC-dropout to model the weight uncertainty. The segmentation is still obtained by setting a threshold on the reconstruction error, as in the vanilla auto-encoder.

2.5.3 Variational Auto-encoders

In some applications, instead of utilizing the lack of generalizability of the model, we want to modify the latent vector z to further guarantee that the reconstructed testing image \( {\hat{\boldsymbol{X}}}^a \) looks closer to a healthy subject. Then again, the residual between Xa and \( {\hat{\boldsymbol{X}}}^a \) is sufficient to highlight the anomalies in the image. Usually, such manipulation requires probabilistic modeling for the latent manifold. Hence, many applications use the variational auto-encoder (VAE) [111] as the backbone of the model (Fig. 15b).

As previously stated, we want the model to learn the distribution of healthy data P(Xh). In the encoder-decoder structure, we introduce a latent vector z at the bottleneck which follows a given distribution P(z). Usually, P(z) is assumed to follow a normal distribution \( \mathcal{N}\left(0,\boldsymbol{I}\right) \). The encoder and decoder are expressed by the conditional probabilities Qθ(z|Xh) and Pϕ(Xh|z), respectively. Then the target distribution is given by

$$ P\left({\mathbf{X}}^h\right)=\int {P}_{\phi}\left({\mathbf{X}}^h|\mathbf{z}\right)P\left(\mathbf{z}\right)d\mathbf{z}. $$
(35)

In addition to the reconstruction loss (e.g., 1/2 norm), the Kullback-Leibler (KL) divergence DKL[Qθ(z|Xh)∥P(z)] that measures the distance of two distributions is another objective function to minimize. This term provides a constraint on the latent manifold such that the feature vector z can be stochastically sampled from a normal distribution. By modifying Eq. 13.35 and then applying Jensen’s inequality, we get the evidence lower bound (ELBO) \( \mathcal{L} \) for the log-likelihood of the healthy data:

$$ \mathcal{L} \left(\theta, \phi \right)={\mathbbm{E}}_{\mathbf{z}\sim {Q}_{\theta}\left(\mathbf{z}|{\mathbf{X}}^h\right)}\left[\log {P}_{\phi}\left({\mathbf{X}}^h|\mathbf{z}\right)\right]-{D}_{KL}\left[{Q}_{\theta}\left(\mathbf{z}|{\mathbf{X}}^h\right)\parallel P\left(\mathbf{z}\right)\right] $$
(36)

It has been proved that maximizing the \( \log P\left({\mathbf{X}}^h\right) \) is equivalent to maximizing its ELBO, so \( -\mathcal{L} \) serves as an objective function to optimize parameters θ and ϕ in the VAE model. By leveraging the same idea in the AE-based methods, the neural networks fθ and gϕ model the normal brain anatomy if the training data contains only the healthy subjects. The approaches using VAE take one more step to guarantee the abnormal region cannot be recovered in the output, that is, modify the latent vector za of the anomalous input such that za ∼ Qθ(z|Xh).

Given that healthy brains Xh and subjects with anomaly Xa are differently distributed, it is reasonable to assume that their latent manifolds Qθ(z|Xh) and Qθ(z|Xa) also vary. Suppose za = fθ(Xa), then naturally, za ∼ Qθ(z|Xa). If we can modify za so that za ∼ Qθ(z|Xh), then after passing through the decoder Pϕ(Xh|z), the reconstruction output of the model \( {\hat{\mathbf{X}}}^a \) would belong in P(Xh). That is to say, the modification in the latent manifold “cures” the anomaly. It is then easy to identify the anomaly as the residual between the input and output. The core part of the process is how to “cure” the latent representation of abnormal input. Some common ways are reported in the following examples.

Distribution Constraint

A straightforward way to force za ∼ Qθ(z|Xh) is adding a specific loss function at the bottleneck. Chen et al. [112] propose an adversarial auto-encoder (AAE) shown in Fig. 15c. The encoder works as a generator that produces samples in the latent space, and an additional discriminator is trained to judge whether the sample is drawn from the normal distribution. It emphasizes that all the latent representations should follow \( \mathcal{N}\left(0,\boldsymbol{I}\right) \), whether the input is healthy or not.

Discrete Encoding

Another solution is proposed by Pinaya et al. [113]. They implement the vector-quantized variational auto-encoder (VQ-VAE) [114] to obtain a discrete representation of the latent tensor \( \boldsymbol{z}\in {\mathbb{R}}^{n_z\times h\times w} \). It can be regarded as a h × w image which contains a vector \( {\boldsymbol{v}}_i\in {\mathbb{R}}^{n_z} \) at each image location, where i = 1, 2, ..., h × w. The quantization of z is realized by a pretrained embedding space (\( {\boldsymbol{e}}_j\in {\mathbb{R}}^{n_z} \), where j = 1, 2, ..., K). It serves as a codebook from which we can always find a code ej that is closest to the given vi. Then by simply replacing the vector vi with the index of its closest counterpart in the codebook, a quantized latent image \( {\boldsymbol{z}}_q\in {\mathbb{R}}^{h\times w} \) is obtained. Theoretically, the abnormal region is “cured” by using ej to approximate vi as the embedding space follows a fixed distribution. As usual, the residual between input and the reconstructed image \( \mid \boldsymbol{X}-\hat{\boldsymbol{X}}\mid \) is used to find the anomaly.

Different Normative Prior

Different from the vanilla VAE described above, Dilokthanakul et al. [115] propose a Gaussian mixture VAE (GMVAE) that replaces the unit multivariate Gaussian prior in the latent space with a Gaussian mixture model. GMVAE was used for brain UAD by You et al. [116]. Following the same idea of ruling out the anomaly in the latent space, they restore the image with anomaly using maximum a posteriori estimation given the Gaussian mixture model.

2.5.4 Variational Auto-encoders with Generative Adversarial Networks

A generative adversarial network (GAN) consists of two modules, a generator G and a discriminator D. Similar with the decoder in VAE, the generator G models the mapping from a latent vector to the image space \( \mathbf{z}\mapsto \mathcal{X} \) where \( \mathbf{z}\sim \mathcal{N}\left(0,\boldsymbol{I}\right) \). The discriminator D can be deemed as a trainable loss function that judges whether the generated image G(z) is in the image space \( \mathcal{X} \). Combining the GAN discriminator and the VAE backbone has become a common idea in UAD problems. More details on GANs can be found in Chap. 5.

We note that D can be used as an additional loss in either latent or image space. In the adversarial auto-encoder (AAE) discussed above, the discriminator works to check whether the latent vector is drawn from the multivariate normal distribution. In contrast, Buar et al. [117] propose the AnoVAEGAN (Fig. 15d) model, in which the discriminator is applied in the image space to check whether the reconstructed image lies in the distribution of healthy data.

3 Medical Image Segmentation Challenges

Medical image segmentation is affected by different aspects of the specific task, such as image quality, visibility of tissue boundaries, and the variability of the target structures. Moreover, each organ, anatomical structure, or lesion type has its own specificities, and a given method may perform well for a given target and worse for another. Therefore, many public challenges are held that target specific problems in an attempt to create benchmarks and attract new researchers into an application field.

In this section, we briefly introduce some of the popular medical image segmentation challenges related to neuroimages. Then, we focus on brain tumor and multiple sclerosis (MS) segmentation challenges and summarize the most competitive methods for each challenge to highlight examples of the concepts discussed in this chapter.

3.1 Popular Segmentation Challenges

Medical image segmentation challenges aim to find better solutions to certain tasks, and it also provides researchers with benchmark or baseline methods for future development. Furthermore, the developments are driven by the need to clinical problems.

Medical Segmentation Decathlon

There are ten different segmentation tasks in the medical segmentation decathlon (MSD), and each task focuses on certain organ/structure [118]. Specifically, liver tumors, brain tumors, hippocampus, lung tumors, prostate, cardiac, pancreas tumors, colon cancer, hepatic vessels, and spleen are the focused organ of each task. Each task usually involves a different modality. For example, multimodal multisite MRI data are used for brain tumors, while liver tumors are studied from portal venous-phase CT data. The Dice score (DSC) and normalized surface distance are used as evaluation metrics due their well-known behavior. Instead of finding the state-of-the-art performance for each task, MSD aims to find generalizable methods.

crossMoDA

These years, domain adaptation techniques are a hot topic in medical image segmentation field, and a new challenge for unsupervised cross-modality domain adaptation is held for researchers which is named as cross-modality domain adaptation (crossMoDA) for medical image segmentation [119]. Furthermore, it is the first large and multi-class benchmark for unsupervised domain adaptation to segment vestibular schwannoma (VS) and cochleas. In a short summary, crossMoDA consists of labeled and unlabeled datasets of T1-weighted and T2-weighted MRIs (T1-w and T2-w images are unpaired). It aims to segment the corresponding regions of interest in unlabeled T2-weighted MRIs by leveraging the information from unpaired and labeled T1-weighted MRIs.

3.2 Brain Tumor Segmentation Challenge

Brain tumor segmentation (BraTS) challenge is an annual challenge held since 2012 [104, 120,121,122,123]. The participants are provided with a comprehensive dataset that includes annotated, multisite, and multi-parametric MR images. It is worth noting that the dataset has increased from 30 cases to 2000 between 2012 and 2021 [123].

Brain tumor segmentation is a difficult task for a variety of reasons [124], including morphological and location uncertainty of tumor, class imbalance between foreground and background, and low contrast of MR images and annotation bias. BraTS focuses on segmentations for the enhancing tumor (ET), tumor core (TC), and whole tumor (WT). The Dice score, 95% Hausdorff distance, sensitivity, and specificity are used as evaluation metrics.

BraTS 2021

There are two tasks in BraTS 2021 and one of them is segmentation of brain tumor subregions (task 1) [123].

Dataset

The BraTS 2021 competition comprises 8000 multi-parametric MR images from 2000 patients. The data split is 1251 cases for training, 219 cases for the validation phase, and 530 cases for final ranking, and ground truth labels are only provided to participants for the training set. The validation phase aims to help the participants examine their algorithm, and the results are shown on the public leaderboard. The dataset contains four MRI modalities per subject (Fig. 16): T1-w, post-contrast T1-w (T1Gd), T2-w, and T2-fluid-attenuated inversion recovery (T2-FLAIR). The images were acquired at different institutions with different protocols and scanners. The pre-processing pipeline includes (1) co-registration to the same anatomical template, (2) resampling to isotropic 1mm3 resolution, and (3) skull stripping.

Fig. 16
A set of 4 M R I scan of the human brain depicts enhancing, tumor core, and whole tumor are labeled A, B, and C. D. M R I scan of the human brain has bright areas indicating edema or invasion, enhancing tumor, and necrosis.

BraTS 2021 dataset. The images and ground truth labels of enhancing tumor, tumor core, and whole tumor are shown in the panels A (T1w with gadolinium injection), B (T2w), and C (T2-FLAIR), respectively. Panel D shows the combined segmentations to generate the final tumor subregion labels. Replicated from [123] (CC BY 4.0)

Winner Method

Luu et al. contributed a novel method [125] that won the first place in the final ranking after being applied to unseen test data. Their work is based on the nnU-Net, the winner of BraTS 2020. Some contributions include using group normalization instead of batch normalization; employing axial attention modules [126, 127] in the decoder part, which is efficient for multidimensional data; and building a deeper network. In the training phase, the networks were trained with 5-fold cross-validation. “Online” data augmentations were applied, including random rotation and scaling, elastic deformation, additive brightness augmentation, and gamma correction. The sum of the cross-entropy and Dice losses was used as the loss function. Last but not least, before feeding the input, the volumes were cropped to nonzero voxels and normalized by their mean and standard deviation.

3.3 Multiple Sclerosis Segmentation Challenge

Multiple sclerosis (MS) lesion segmentation from MR images is challenging for both radiologists and automated algorithms. The difficulties of this task include the large variability of lesion appearance, boundary, shape, and location, as well as variations in image appearance caused by different scanners and acquisition protocols from different institutes [128].

MSSEG-2

Delineation of new MS lesions on T2/FLAIR images is of interest as a biomarker of the effectiveness of anti-inflammatory disease-modifying drugs. Building upon the MSSEG (multiple sclerosis segmentation) challenge, MSSEG-2 (https://portal.fli-iam.irisa.fr/msseg-2/) focuses on new MS lesion detection and segmentation. Here, we focus on the new lesion segmentation task.

Dataset

The MSSEG-2 challenge dataset consists of 100 MS patients with 200 scans. Each subject has two FLAIR scans at different timepoints, with a time gap between 1 and 3 years. The images are acquired with 15 different 1.5T/3T scanners. Forty patients and their labels are used for training, and 120 scans of 60 patients are provided to test the performance.

Winner Method

Zhang et al. proposed a novel method for segmentation of new MS lesions [56] that performed best for the Dice score evaluation. They adopted the model from [46], which is based on the U-Net and dense connections. The model inputs the concatenation of MR images from different timepoints and outputs the new MS lesion segmentation for each patient. In addition, the 2.5D method, which stacks slices from three different orthogonal views (axial, sagittal, and coronal), is applied to each MR scan. In this way, both local and global information are provided to the model during training. Furthermore, to increase the generalizability of the model from the source domain to the target domain, three types of data augmentation are used that include image quality augmentation, image intensity augmentation, and spatial augmentation.

4 Conclusion

Image segmentation is a crucial task in medical image analysis. With the help of deep learning algorithms, one can achieve more precise segmentation on brain structures and lesions. In this chapter, we first introduced the fundamental components (Subheadings 2.1.12.1.6) needed to set up a complete deep neural network for a medical image segmentation task. Next, we provided a review of the rich literature on medical image segmentation methods categorized by supervision settings in Subheading 2.22.5. For each type of supervision, we explained the main ideas and provided example applications. Finally, we introduced some medical image segmentation challenges (Subheading 3) that have publicly available data, so that the readers can start their own projects.