Background & Summary

One of the most commonly used imaging technologies in preclinical research is micro Computed Tomography (µCT)1,2, with over 22.000 entries on PubMed for the keyword “micro-CT” up to this date. It offers high resolution, fast acquisition and well calibrated voxel intensities, giving detailed insights into volumes and internal structures of small animals3. It has high reproducibility and can be utilized both as a standalone image modality or combined with nuclear imaging such as Positron Emission Tomography (PET) or Single Photon Emission Computed Tomography (SPECT)4.

Longitudinal studies can be performed with µCT, as the radiation dose is low. This enables monitoring of disease and treatment progression in the same animal by performing multiple scans, thus extracting more information per animal. This reduces the number of animals required to conduct studies, in accordance with the animal protection 3 R aims (Refinement, Replacement and Reduction)5.

µCT is often performed on a large scale for preclinical research, but the resulting images require further manual analysis to be useful. The current gold-standard is to perform manual delineation of regions of interest, which is both laborious and subject to high user-dependence6,7. This limits the reproducibility of preclinical studies, and the time needed for manual analysis can easily exceed that of the scanning procedure itself. Hence, there is an unmet need for automatic segmentation (AS) tools to mitigate the challenges of reproducibility and time consumption in preclinical imaging studies. Automatic segmentation models are machine learning models that once trained, can take in a new image and decide what label should be assigned to each pixel in the image without any human intervention needed.

Recently, with the introduction of machine learning algorithms, repetitive tasks that require human interaction can be automated by training models on large datasets. The use of machine learning algorithms for AS offers the prospect of improving reproducibility, consistency, and reliability in the analysis, and thus a possible solution to the aforementioned challenges in the analysis of preclinical images.

A widespread disease model in image-based preclinical research is immunosuppressed mice in which xenografted tumor cells from human cancers have been injected under the skin, which then develop into human-like subcutaneous tumors8,9,10,11. These models are a staple for human anti-cancer drug discovery, where the drug uptake in the tumor can be measured as well as the tumor growth rate and tumor metabolism. This model can further be utilized for personalized medicine by xenografting individual human tumor biopsies to assess the sensitivity to different anti-cancer agents in a patient-to-patient approach.

While several approaches to AS on medical images exist, there are no public AS models or datasets for subcutaneous tumors in neither µCT scans12 or Magnetic Resonance Imaging (MRI) scans13. Research on AS for other types of tumors has been done for µCT14,15, as well as for MRI16,17,18,19,20,21, and cryogenic-imaging22. However, these were generally performed on small datasets, which limit their usefulness as a general tool. Also, they have not made their models publicly available.

Classically, the approaches to AS have often been atlas-based algorithms, where one or multiple anatomical atlases guide the AS23,24, or filter-based, where a large set of filters are used to extract features for a machine learning algorithm25. However, subcutaneous tumors differ significantly in anatomical placement and morphology between each mouse, which makes them less suitable for atlas-based algorithms. The texture of the tumor and surrounding soft-tissue is quite similar, which also makes texture-based methods less suitable. Deep learning models excel in learning complex interactions between morphology and texture, but require large amounts of high-quality data to be trained successfully26. Our dataset is aimed at filling this data gap and enable deep learning models to be trained for this segmentation task.

We provide a preclinical µCT whole-body database of mice with subcutaneous tumors, publicly available. It consists of 452 whole-body µCT scans from 223 individual mice, retrospectively collected from ten different datasets at our institution spanning the years 2014 to 2020. All scans are annotated by three trained annotators, which gives our dataset the size and diversity needed for developing robust AS algorithms, as well as providing a human baseline for inter-annotator agreement.

The aim is that our database will serve as a resource to train and validate machine learning algorithms for AS, thus facilitating the development of fast, robust, and reproducible analysis tools for subcutaneous tumor models.

Methods

Datasets

Ten µCT datasets from 2014 to 2020 were collected (Table 1 and Figure 1). All animal experiments were approved by the Danish Animal Experiments Inspectorate (permit number 2012-15-2934-00064 and 2016-15-0201-00920). The animals were housed in the core animal facilities at the University of Copenhagen, Denmark, where they were exposed to a 12:12 hours light/dark cycle, with a temperature of 21 ± 2 °C, and access to water and rodent food ad libitum. The animals were acclimatized for at least one week before being included in the experiments. The included µCT scans have not previously been published but were only used to anatomically guide the extraction of values from corresponding PET images.

Table 1 Details for each dataset. Std = Standard Deviation.
Fig. 1
figure 1

Example of a µCT scan for each of the 10 datasets, with an axial slice containing the tumor with and without the tumor mask overlayed in red, and a 3D Maximum Intensity Projection of the entire scan shown below the axial slices.

The 10 datasets collectively contain 452 µCT scans of 223 individual mice. The mice were scanned longitudinally at different time intervals on a preclinical µCT/PET scanner (Inveon, Siemens, USA). All scans were performed on athymic nude mice with human xenografted tumor cells, which had been allowed to develop into subcutaneous tumors prior to performing the scans. In 3 of the 10 datasets, each animal had two tumors: one on each flank. In the remainder of the 7 datasets, each animal had one tumor on the flank (Table 1 and Figure 1). In Dataset 8, the mice had the tumor inoculated behind the front legs instead of the flank. Mice with external necrosis on tumors and a total tumor burden of over 2000 µL were euthanized due to ethical concerns, and a typical humane endpoint would be 1500 µL. In Dataset 8 and 10, the mice were scanned in a small animal bed, and the remainder of the mice were scanned laying freely on the bed, which reflects different real world scanning scenarios.

The mice were all aged from 6–8 weeks at the time of enrolment into the experiments. During the µCT scanning procedure, the mice were anesthetized with a continuous flow of 1–2% Sevoflurane, while being placed on a heated bed. All scans were reconstructed using either filtered back projection or the Feldkamp Cone Beam algorithm in the vendor-supplied software (Inveon Acquisition Workplace, Siemens, USA) with a voxel size of 0.210 × 0.210 × 0.210 mm and no spacing between slices. All scans were acquired at 500 µA, while the voltage and exposure times differed between datasets. All details can be seen in Table 2.

Table 2 Detailed scanning parameters for each dataset.

Data preprocessing and annotation

Each µCT scan was performed with either two or four mice in the scanner at the same time, with each mouse being placed in a small animal bed. The µCT scans were preprocessed by cropping out the mice into 192 × 192 pixels in the x- and y-axis, while the full length along the z-axis was kept, and then clipping the dynamic range between −400 and 1,000 Hounsfield Units. Cropping out each mouse eases the process of training machine learning algorithms as well as reducing the space needed for storage, since the mice would be surrounded by air in the field of view of the µCT scan, which the cropping process would remove the majority of.

After the µCT scans were preprocessed, all tumors were then manually labeled by three independent annotators, using the Napari Viewer27 in Python 3.8 (Python Software Foundation, Delaware, USA). The tumors were annotated by drawing on every 5th axial slice and then using linear interpolation to form the 3D annotation of the tumor. Annotations touching either air or bones were automatically removed by thresholding to speed up the annotation process, followed by inspection and potential correction by the annotator if needed. A threshold of under −300 HU for air and over 500 HU for bones was used. If any central necrosis was present in the tumor, it was included in the annotation mask, in accordance with the RECIST guidelines28, to ensure clinical relevance and translatability. All annotators were blinded from the dataset number and scan time of the mice during annotation to avoid biasing the delineation of the tumors.

Annotation metrics & evaluation

We used the following metrics to evaluate the annotations, which were all performed over the tumor in 3D (i.e. not slice-wise). The inter-annotator agreement was evaluated by calculating the Sørensen-Dice coefficient between annotators29. In our case, it was used to compare the agreement between the three pairs of annotators (A vs B, A vs C, and B vs C). The Sørensen-Dice coefficient was calculated with the following formula:

$${SD}=\frac{2\left|X\cap Y\right|}{\left|X\right|+\left|Y\right|}$$

Where X and Y represent the set of segmented voxels by two different annotators. The Sørensen-Dice coefficient varies between 0 and 1, where a score of 1 denotes a perfect overlap between the segmentations and a score of 0 denotes no overlap.

To assess the overall agreement between the three annotators, we used Fleiss’ Kappa30. This similarity coefficient is related to Cohen’s Kappa31 but extends to multiple annotators. In brief, Fleiss’ Kappa is calculated by the following formula:

$$\kappa =\frac{\bar{P}-{\bar{P}}_{e}}{1-{\bar{P}}_{e}}$$

The denominator \(1-{\bar{P}}_{e}\) designates the degree of agreement, which is attainable above chance, while the numerator \(\bar{P}-{\bar{P}}_{e}\) designates the degree of agreement that was actually achieved above chance. If the annotators are in complete agreement, then \(\kappa =1\), while if there is no agreement between the annotators above what would be expected by chance, then \(\kappa \le 0\). If the agreement is exactly the same as what is expected by chance, then \(\kappa =0\).

The agreement of volume estimation across the three annotators was estimated as the difference in estimated volume between the pairs of annotators. If two tumors were present in the mouse, their volumes were calculated individually. For comparing the agreement of all annotators on volume estimation, the Root Mean Squared Error (RMSE) between the volume estimated from each annotator and the mean of the volume from all three annotators was used. The RMSE indicates the average volume deviation of each annotator from the mean volume estimated from all annotators. The RMSE was used rather than the mean difference of all pairs, as this would trivially be zero, given the annotators were subtracted in the right order, and hence not yield any information.

The annotation results are presented in Table 3, and a detailed evaluation on a dataset-level can be seen in Fig. 2 and in Table 4.

Table 3 Comparison between pairs of annotators and all annotators across all datasets.
Fig. 2
figure 2

Sørensen-Dice coefficient across annotators on each dataset and difference in volume. Each dataset is color-coded, while the annotator pairs are indicated by the hatching: Annotator A vs. B, A vs. C and B vs. C is shown. The middle line is the median, box ends are quartiles, whiskers are 1.5 interquartile range and dots are outliers outside the 1.5 interquartile range. (a) depicts the Sørensen-Dice coefficient, (b) depicts the difference in volume estimation and (b) depicts the root mean squared error between the mean tumor volume estimated from all 3 annotators, and the volume each annotator has estimated. The metrics were calculated over the 3D tumor volumes.

Table 4 Sørensen-Dice coefficient and Fleiss’ Kappa on a dataset level (mean ± std) for annotator A, B and C. The metrics were calculated over the 3D tumor volume.

Data Records

The dataset is available at the University of Copenhagen Electronic Research Data Archive32. The data are organized into folders for each dataset, called Dataset 1 to 10. Each dataset folder contains subfolders with mice that were scanned together (either two or four in the same scan). Each of these folders again contains subfolders with the cropped out µCT scan for each mouse, as well as the annotations from each of the 3 annotators. The mice were named MXX, where XX is the number of each mouse in the dataset. Mouse numbers will occur multiple times if the same mouse was scanned at several time points in a dataset. The scan time points appear in the names as Xh or Xd, where X is the number of hours or days since the first scan of the mouse, respectively. An overview of the folder structure can be seen in Fig. 3. The data are saved in compressed Neuroimaging Informatics Technology Initiative (NIfTI) format33, which is compatible with most platforms for medical images. Detailed descriptions of the datasets can be found in Table 2. The xenograft tumor cell line information was not available or was of proprietary nature, and therefore, it is not included in this dataset.

Fig. 3
figure 3

Overview of the folder structure for the datasets. MXX is the mouse number and Xh or Xd is the hours or days since the first scan of the mouse, respectively.

Technical Validation

The presented dataset offers a basis for both development and evaluation of AS algorithms. It further establishes a baseline for human inter-annotator agreement. The overall agreement of the annotators was 0.903 Fleiss’ Kappa, and the Sørensen-Dice coefficient between pairs of annotators was around ~0.90 (Table 3). The annotator agreement was slightly higher for datasets 5, 7, 8, and 9 compared to the rest of the datasets (Fig. 2 and Table 4), which was likely due to the tumors being larger in these datasets. The degree of agreement was similar to what other studies with manual segmentation of CT images have reported12,34,35,36,37,38,39. For example, in Rosenhain et al.12 the inter-annotator agreement was 0.810 Sørensen-Dice coefficient for tumors in contrast-enhanced µCT scans, and at most 0.879 Sørensen-Dice coefficient for the organs. As an clinical example, Patil et al.38 obtained a Sørensen-Dice coefficient of 0.89-0.90 for lung tumors on human CT scans. Our finding of around 0.90 Sørensen-Dice coefficient between the annotators was hence reasonable compared to similar datasets.

For the estimation of the tumor volume, each annotator pair had a mean disagreement close to zero mL across all datasets, with a standard deviation of about 0.030 mL. The RMSE from the mean volume was 0.015 mL across all datasets. In Dataset 4, 6, 9, and 10, the annotators had slightly lower variance on the agreement in volume, compared to the rest of the datasets (Fig. 3 and Table 3). This was most likely due to the image quality being slightly higher for these datasets. We note that these results are specific to human xenografts, and that other tumor models such as syngeneic models could elicit different results.

Usage Notes

All interested researchers are highly encouraged to download the 3D µCT dataset and use it for their own experiments and model development. It can be used to train AS algorithms and evaluate their accuracy against human annotators or be used as an external evaluation dataset for AS algorithms, which are trained on a different dataset. Since the dataset is annotated by three individual researchers, all annotations can be utilized in the training of the AS algorithms to yield more general and de-biased models.

When evaluating AS algorithms on our dataset, we suggest that users test and report their performance on each individual annotator’s annotations, as well as the mean performance across all annotators. We have further supplied annotations that are merged from the three annotators by the STAPLE40 algorithm, which can be additionally used to report performance of an AI model.

Having multiple annotations can further be used to develop and evaluate uncertainty quantification algorithms, as the uncertainty for each scan can be calculated through the three different annotations41. The dataset can further be used in training deep learning algorithms on other tasks than subcutaneous tumor segmentations, e.g. annotating new anatomical structures or for self-supervised pretraining. The NIfTI format ensures that the scans are compatible with a broad array of commercial and non-commercial software.