Robust identification of perturbed cell types in single-cell RNA-seq data

Nicol, Phillip B.; Paulson, Danielle; Qian, Gege; Liu, X. Shirley; Irizarry, Rafael; Sahu, Avinash D.

doi:10.1038/s41467-024-51649-3

Robust identification of perturbed cell types in single-cell RNA-seq data

Article
Open access
Published: 01 September 2024

Volume 15, article number 7610, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Robust identification of perturbed cell types in single-cell RNA-seq data

Download PDF

842 Accesses
16 Altmetric
Explore all metrics

Abstract

Single-cell transcriptomics has emerged as a powerful tool for understanding how different cells contribute to disease progression by identifying cell types that change across diseases or conditions. However, detecting changing cell types is challenging due to individual-to-individual and cohort-to-cohort variability and naive approaches based on current computational tools lead to false positive findings. To address this, we propose a computational tool, scDist, based on a mixed-effects model that provides a statistically rigorous and computationally efficient approach for detecting transcriptomic differences. By accurately recapitulating known immune cell relationships and mitigating false positives induced by individual and cohort variation, we demonstrate that scDist outperforms current methods in both simulated and real datasets, even with limited sample sizes. Through the analysis of COVID-19 and immunotherapy datasets, scDist uncovers transcriptomic perturbations in dendritic cells, plasmacytoid dendritic cells, and FCER1G+NK cells, that provide new insights into disease mechanisms and treatment responses. As single-cell datasets continue to expand, our faster and statistically rigorous method offers a robust and versatile tool for a wide range of research and clinical applications, enabling the investigation of cellular perturbations with implications for human health and disease.

Atlas-scale single-cell multi-sample multi-condition data integration using scMerge2

Article Open access 17 July 2023

Exploring and analysing single cell multi-omics data with VDJView

Article Open access 18 February 2020

Interpretation of T cell states from single-cell transcriptomics data using reference atlases

Article Open access 20 May 2021

Introduction

The advent of single-cell technologies has enabled measuring transcriptomic profiles at single-cell resolution, paving the way for the identification of subsets of cells with transcriptomic profiles that differ across conditions. These cutting-edge technologies empower researchers and clinicians to study human cell types impacted by drug treatments, infections like SARS-CoV-2, or diseases like cancer. To conduct such studies, scientists must compare single-cell RNA-seq (scRNA-seq) data between two or more groups or conditions, such as infected versus non-infected¹, responders versus non-responders to treatment², or treatment versus control in controlled experiments.

Two related but distinct classes of approaches exist for comparing conditions in single-cell data: differential abundance prediction and differential state analysis³. Differential abundance approaches, such as DA-seq, Milo, and Meld^4,5,6,7, focus on identifying cell types with varying proportions between conditions. In contrast, differential state analysis seeks to detect predefined cell types with distinct transcriptomic profiles between conditions. In this study, we focus on the problem of differential state analysis.

Past differential state studies have relied on manual approaches involving visually inspecting data summaries to detect differences in scRNA data. Specifically, cells were clustered based on gene expression data and visualized using uniform manifold approximation (UMAP)⁸. Cell types that appeared separated between the two conditions were identified as different¹. Another common approach is to use the number of differentially expressed genes (DEGs) as a metric for transcriptomic perturbation. However, as noted by ref. ⁹, the number of DEGs depends on the chosen significance level and can be confounded by the number of cells per cell type because this influences the power of the corresponding statistical test. Additionally, this approach does not distinguish between genes with large and small (yet significant) effect sizes.

To overcome these limitations, Augur⁹ uses a machine learning approach to quantify the cell-type specific separation between the two conditions. Specifically, Augur trains a classifier to predict condition labels from the expression data and then uses the area under the receiver operating characteristic (AUC) as a metric to rank cell types by their condition difference. However, Augur does not account for individual-to-individual variability (or pseudoreplication¹⁰), which we show can confound the rankings of perturbed cell types.

In this study, we develop a statistical approach that quantifies transcriptomic shifts by estimating the distance (in gene expression space) between the condition means. This method, which we call scDist, introduces an interpretable metric for comparing different cell types while accounting for individual-to-individual and technical variability in scRNA-seq data using linear mixed-effect models. Furthermore, because transcriptomic profiles are high-dimensional, we develop an approximation for the between-group differences, based on a low-dimensional embedding, which results in a computationally convenient implementation that is substantially faster than Augur. We demonstrate the benefits using a COVID-19 dataset, showing that scDist can recover biologically relevant between-group differences while also controlling for sample-level variability. Furthermore, we demonstrated the utility of the scDist by jointly inferring information from five single-cell immunotherapy cohorts, revealing significant differences in a subpopulation of NK cells between immunotherapy responders and non-responders, which we validated in bulk transcriptomes from 789 patients. These results highlight the importance of accounting for individual-to-individual and technical variability for robust inference from single-cell data.

Results

Not accounting for individual-to-individual variability leads to false positives

We used blood scRNA-seq from six healthy controls¹ (see Table 1), and randomly divided them into two groups of three, generating a negative control dataset in which no cell type should be detected as being different. We then applied Augur to these data. This procedure was repeated 20 times. Augur falsely identified several cell types as perturbed (Fig. 1A). Augur quantifies differences between conditions with an AUC summary statistic, related to the amount of transcriptional separation between the two groups (AUC = 0.5 represents no difference). Across the 20 negative control repeats, 93% of the AUCs (across all cell typess) were >0.5, and red blood cells (RBCs) were identified as perturbed in all 20 trials (Fig. 1A). This false positive result was in part due to high across-individual variability in cell types such as RBCs (Fig. 1B).

Table 1 Datasets used in the figures

Full size table

**Fig. 1: Evaluating *Augur*’s performance in negative control experiments.**

We confirmed that individual-to-individual variation underlies false positive predictions made by Augur using a simulation. We generated simulated scRNA-seq data with no condition-level difference and varying patient-level variability (Methods). As patient-level variability increased, differences estimated by Augur also increased, converging to the maximum possible AUC of 1 (Fig. 1C): Augur falsely interpreted individual-to-individual variability as differences between conditions.

Augur recommends that unwanted variability should be removed in a pre-processing step using batch correction software. We applied Harmony¹¹ to the same dataset¹, treating each patient as a batch. We then applied Augur to the resulting batch corrected PC scores and found that several cell types still had AUCs significantly above the null value of 0.5 (Fig. S1a). On simulated data, batch correction as a pre-processing step also leads to confounding individual-to-individual variability as condition difference (Fig. S1b).

A model-based distance metric controls for false positives

To account for individual-to-individual variability, we modeled the vector of normalized counts with a linear mixed-effects model. Mixed models have previously been shown to be successful at adjusting for this source of variability¹⁰. Specifically, for a given cell type, let z_ij be a length G vector of normalized counts for cell i and sample j (G is the number of genes). We then model

$${{{\bf{z}}}}_{ij}={{\boldsymbol{\alpha }}}+{x}_{j}{{\boldsymbol{\beta }}}+{{{\boldsymbol{\omega }}}}_{j}+{{{\boldsymbol{\varepsilon }}}}_{ij}$$

(1)

where α is a vector with entries α_g representing the baseline expression for gene g, x_j is a binary indicator that is 0 if individual j is in the reference condition, and 1 if in the alternative condition, β is a vector with entries β_g representing the difference between condition means for gene g, ω_j is a random effect that represents the differences between individuals, and ε_ij is a random vector (of length G) that accounts for other sources of variability. We assume that ${{{\boldsymbol{\omega }}}}_{j}\mathop{ \sim }\limits^{{{\rm{i}}}.{{\rm{i}}}.{{\rm{d}}}}{{\mathcal{N}}}(0,{\tau }^{2}I)$, ${{{\boldsymbol{\varepsilon }}}}_{ij}\mathop{ \sim }\limits^{{{\rm{i}}}.{{\rm{i}}}.{{\rm{d}}}}{{\mathcal{N}}}(0,{\sigma }^{2}I)$, and that the ω_j and ε_ij are independent of each other.

To obtain normalized counts, we recommend defining z_ij to be the vector of Pearson residuals obtained from fitting a Poisson or negative binomial GLM¹², the normalization procedure is implemented in the scTransform function¹³. However, our proposed approach can be used with other normalization methods for which the model is appropriate.

Note that in model (18), the means for the two conditions are α and α + β, respectively. Therefore, we quantify the difference in expression profile by taking the 2 − norm of the vector β:

$$D:=\left| \left| {\boldsymbol{\beta }} \right| \right| _{2}={\left({\boldsymbol{\beta} }^{\top }{\boldsymbol{\beta }} \right)}^{1/2}=\sqrt{{\sum} _{g=1}^{G}{\beta }_{g}^{2}}.$$

(2)

Here, D can be interpreted as the Euclidean distance between condition means (Fig. 2A).

**Fig. 2: Visual representation of the *scDist* method.**

Because we expected the vector of condition differences β to be sparse, we improved computational efficiency by approximating D with a singular value decomposition to find a K × G matrix U, with K much smaller than G, and

$$D \, \approx \,{D}_{K}:=\sqrt{{\sum} _{k=1}^{K}{(U{\boldsymbol{\beta}} )}_{k}^{2}}.$$

With this approximation in place, we fitted model equation (18) by replacing z_ij with Uz_ij to obtain estimates of (Uβ)_k. A challenge with estimating D_K is that the maximum likelihood estimator can have a significant upward bias when the number of patients is small (as is typically the case). For this reason, we employed a post-hoc Bayesian procedure to shrink ${{(U{\boldsymbol{\beta}} )}_{k}^{2}}$ towards zero and compute a posterior distribution of D_K¹⁴. We also provided a statistical test for the null hypothesis that D_K = 0. We refer to the resulting procedure as scDist (Fig. 2B). Technical details are provided in Methods.

We applied scDist to the negative control dataset based on blood scRNA-seq from six healthy used to show the large number of false positives reported by Augur (Fig. 1) and found that the false positive rate was controlled (Fig. 3A, B). We then applied scDist to the data from the simulation study and found that, unlike Augur, the resulting distance estimate does not grow with individual-to-individual variability (Fig. 3C). scDist also accurately estimated distances on fully simulated data (Fig. S2).

Fig. 3: Application and performance of *scDist.*

The Euclidean distance D measures perturbation by taking the sum of squared differences across all genes. To show that this measure is biologically meaningful, we applied scDist to obtain estimated distances between pairs of known cell types in the above dataset and then applied hierarchical clustering to these distances. The resulting clustering is consistent with known relationships driven by cell lineages (Fig. 3D). Specifically, Lymphoid cell types T and NK cells clustered together, while B cells were further apart, and Myeloid cell types DC, monocytes, and neutrophils were close to each other.

Though the scDist distance D assigns each gene an equal weight (unweighted), scDist includes an option to assign different weights w_g to each gene (Methods). Weighting could be useful in situations where certain genes are known to contribute more to specific phenotypes. We conducted a simulation to study the impact of using the weighted distance. These simulations show that when a priori information is available, using the correct weighting leads to a slightly better estimation of the distance. However, incorrect weighting leads to significantly worse estimation compared to the unweighted distance (Fig. S3). Therefore, the unweighted distance is recommended unless strong a priori information is available.

Challenges in cell type annotations are expected to impact scDist’s interpretation, much like it does for other methods reliant on a priori cell type annotation such as^3,9. Our simulations (see Methods), reveal scDist’s vulnerability to false-negatives when annotations are confounded by condition- or patient-specific factors. However, when clusters are annotated using data where such differences have been removed, scDist’s predictions become more reliable (Fig. S23). Thus, we recommend removing these confounders before annotation. As potential issues could occur when the inter-condition distance exceeds the inter-cell-type distance, scDist provides a diagnostic plot (Fig. S6) to compare these two distances. scDist also incorporates an additional diagnostic feature (Fig. S24) to identify annotation issues, utilizing a cell-type tree to evaluate cell relationships at different hierarchical levels. Inconsistencies in scDist’s output signal potential clustering or annotation errors.

Comparison to counting the number of DEGs

We also compared scDist to the approach of counting the number of differentially expressed genes (nDEG) on pseudobulk samples³. Given that the statistical power to detect DEGs is heavily reliant on sample size, we hypothesized that nDEG could become a misleading measure of perturbation in single-cell data with a large variance in the number of cells per cell type. To demonstrate this, we applied both methods to resampled COVID-19 data¹ where the number of cells per cell type was artificially varied between 100 and 10,000. nDEG was highly confounded by the number of cells (Fig. 4A), whereas the scDist distance remained relatively constant despite the varying number of cells (Fig. 4B). When the number of subsampled cells is small, the ranking of cell types (by perturbation) was preserved by scDist but not by nDEG (Fig. S5a–c). Additionally, scDist was over 60 times faster than nDEG since the latter requires testing all G genes as opposed to K ≪ G PCs (Fig. S4).

**Fig. 4: The number of differentially expressed genes is susceptible to differences in statistical power.**

An additional limitation of nDEG is that it does not account for the magnitude of the differential expression. We illustrated this with a simple simulation that shows the number of DEGs between two cell types can be the same (or less) despite a larger transcriptomic perturbation in gene expression space (Fig. S7a, b). To demonstrate this on real data, we considered a dataset consisting of eight sorted immune cell types (originally from ref. ¹⁵ and combined by ref. ¹⁶) where scDist and nDEG were applied to all pairs of cell types, and the perturbation estimates were visualized using hierarchical clustering. Although both nDEG and scDist performed well when the sample size was balanced across cell types (Fig. S8), nDEG provided inconsistent results when the CD14 Monocytes were downsampled to create a heterogeneous cell type size distribution. Specifically, scDist produced the expected result of clustering the T cells together, whereas nDEG places the Monocytes in the same cluster as B and T cells (Fig. 4C, D) despite the fact that these belong to different lineages. Thus by taking into account the magnitude of the differential expression, scDist is able to produce results more in line with known biology.

We also considered varying the number of patients on simulated data with a known ground truth. Again, the nDEG (computed using a mixed model, as recommended by ref. ¹⁰) increases as the number of patients increases, whereas scDist remains relatively stable (Fig. S9a). Moreover, the correlation between the ground truth perturbation and scDist increases as the number of patients increases (Fig. S9b). Augur was also sensitive to the number of samples and had a lower correlation with the ground truth than both nDEG and scDist.

scDist detects cell types that are different in COVID-19 patient compared to controls

We applied scDist to a large COVID-19 dataset¹⁷ consisting of 1.4 million cells of 64 types from 284 PBMC samples from 196 individuals consisting of 171 COVID-19 patients and 25 healthy donors. The large number of samples of this dataset permitted further evaluation of our approach using real data rather than simulations. Specifically, we defined true distances between the two groups by computing the sum of squared log fold changes (across all genes) on the entire dataset and then estimated the distance on random samples of five cases versus five controls. Because Augur does not estimate distances explicitly, we assessed the two methods’ ability to accurately recapitulate the ranking of cell types based on established ground truth distances. We found that scDist recovers the rankings better than Augur (Fig. 5A, S10). When the size of the subsample is increased to 15 patients per condition, the accuracy of scDist to recover the ground truth rank and distance improves further (Fig. S25).

**Fig. 5: Comparison of *scDist* and *Augur* performance based on real data simulation.**

To evaluate scDist’s accuracy further, we defined a new ground truth using the entire COVID-19 dataset, consisting two groups: four cell types with differences between groups (true positives) and five cell types without differences (false positives) (Fig. S11, Methods). We generated 1000 random samples with only five individuals per cell type and estimated group differences using both Augur and scDist. Augur failed to accurately separate the two groups (Fig. 5C); median difference estimates of all true positive cell types, except MK167+ CD8+T, were lower than median estimates of all true negative cell types (Fig. 5C). In contrast, scDist showed a separation between scDist estimates between the two groups (Fig. 5D).

Single-cell data can also exhibit dramatic sample-specific variation in the number of cells of specific cell types. This imbalance can arise from differences in collection strategies, biospecimen quality, and technical effects, and can impact the reliability of methods that do not account for sample-to-sample or individual-to-individual variation. We measured the variation in cell numbers within samples by calculating the ratio of the largest sample’s cell count to the total cell counts across all samples (Methods). Augur’s predictions were negatively impacted by this cell number variation (Figs. 5E, S12), indicating its increased susceptibility to false positives when sample-specific cell number variation was present (Fig. 1C). In contrast, scDist’s estimates were robust to sample-specific cell number variation in single-cell data (Fig. 5F).

To further demonstrate the advantage of statistical inference in the presence of individual-to-individual variation, we analyzed the smaller COVID-19 dataset¹ with only 13 samples. The original study¹ discovered differences between cases and controls in CD14+ monocytes through extensive manual inspection. scDist identified this same group as the most significantly perturbed cell type. scDist also identified two cell types not considered in the original study, dendritic cells (DCs) and plasmacytoid dendritic cells (pDCs) (p = 0.01 and p = 0.04, Fig. S13a), although pDC did not remain significant after adjusting for multiple testing. We note that DCs induce anti-viral innate and adaptive responses through antigen presentation¹⁸. Our finding was consistent with studies reporting that DCs and pDCs are perturbed by COVID-19 infection^19,20. In contrast, Augur identified RBCs, not CD14+ monocytes, as the most perturbed cell type (Fig. S14). Omitting the patient with the most RBCs dropped the perturbation between infected and control cases estimated by Augur for RBCs markedly (Fig. S14), further suggesting that Augur predictions are clouded by patient-level variability.

scDist enables the identification of genes underlying cell-specific across-condition differences

To identify transcriptomic alteration, scDist assigns an importance score to each gene based on its contribution to the overall perturbation (Methods). We assessed this importance score for CD14+ monocytes in small COVID-19 datasets. In this cell type, scDist assigned the highest importance score to genes S100 calcium-binding protein A8 (S100A8) and S100 calcium-binding protein A9 (S100A9) (p < 10⁻³, Fig. S13b). These genes are canonical markers of inflammation²¹ that are upregulated during cytokine storm. Since patients with severe COVID-19 infections often experience cytokine storms, the result suggests that S100A8/A9 upregulation in CD14+ monocyte could be a marker of the cytokine storm²². These two genes were reported to be upregulated in COVID-19 patients in the study of 284 samples¹⁷.

scDist identifies transcriptomic alterations associated with immunotherapy response

To demonstrate the real-world impact of scDist, we applied it to four published dataset used to understand patient responses to cancer immunotherapy in head and neck, bladder, and skin cancer patients, respectively^2,23,24,25. We found that each individual dataset was underpowered to detect differences between responders and non-responders (Fig. S15). To potentially increase power, we combined the data from all cohorts (Fig. 6A). However, we found that analyzing the combined data without accounting for cohort-specific variations led to false positives. For example, responder-non-responder differences estimated by Augur were highly correlated between pre- and post-treatments (Fig. 6B), suggesting a confounding effect of cohort-specific variations. Furthermore, Augur predicted that most cell types were altered in both pre-treatment and post-treatment samples (AUC > 0.5 for 41 in pre-treatment and 44 in post-treatment out of a total of 49 cell types), which is potentially due to the confounding effect of cohort-specific variations.

Fig. 6: Immunotherapy cohorts analysis using *scDist.*

To account for cohort-specific variations, we ran scDist including an explanatory variable to the model (18) to account for cohort effects. With this approach, distance estimates were not correlated significantly between pre- and post-treatment (Fig. 6B). Removal of these variables re-established correlation (Fig. S16). scDist predicted CD4-T and CD8-T altered pre-treatment (Fig. S17a), while NK, CD8-T, and B cells altered post-treatment (Fig. S17b). Analysis of subtypes revealed FCER1G+NK cells (NK-2) were changed in both pre-treatment and post-treatment samples (Fig. 6C). To validate this finding, we generated an NK-2 signature differential between responders and non-responders (Fig. S18 Methods) and evaluated these signatures in bulk RNA-seq immunotherapy cohorts, composing 789 patient samples (Fig. 6A). We scored each of the 789 patient samples using the NK-2 differential signature (Methods). The NK-2 signature scores were significantly associated with overall and progression-free survival (Fig. 6D) as well as radiology-based response (Fig. 6E). We similarly evaluated the top Augur prediction. Differential signature from plasma, the top predicted cell type by Augur, did not show an association with the response or survival outcomes in 789 bulk transcriptomes (Fig. S19, Methods).

scDist is computationally efficient

A key strength of the linear modeling framework used by scDist is that it is efficient on large datasets. For instance, on the COVID-19 dataset with 13 samples¹, scDist completed the analysis in around 50 seconds, while Augur required 5 minutes. To better understand how runtime depends on the number of cells, we applied both methods to subsamples of the dataset that varied in size and observed that scDist was, on average, five-fold faster (Fig. S20). scDist is also capable of scaling to millions of cells. On simulated data, scDist required approximately 10 minutes to fit a dataset with 1,000,000 cells (Fig. S21). We also tested the sensitivity of scDist to the number of PCs used by comparing D_K for various values of K. We observed that the estimated distances stabilize as K increases (Fig. S22), justifying K = 20 as a reasonable choice for most datasets.

Discussion

The identification of cell types influenced by infections, treatments, or biological conditions is crucial for understanding their impact on human health and disease. We present scDist, a statistically rigorous and computationally fast method for detecting cell-type specific differences across multiple groups or conditions. By using a mixed-effects model, scDist estimates the difference between groups while quantifying the statistical uncertainty due to individual-to-individual variation and other sources of variability. We validated scDist through the unbiased recapitulation of known relationships between immune cells and demonstrated its effectiveness in mitigating false positives from patient-level and technical variations in both simulated and real datasets. Notably, scDist facilitates biological discoveries from scRNA cohorts, even when the number of individuals is limited, a common occurrence in human scRNA-seq datasets. We also pointed out how the detection of cell-type specific differences can be obscured by batch effects or other confounders and how the linear model used by our approach permits accounting for these.

Since the same expression data is used for annotation and by scDist, there are potential issues associated with “double dipping.” Our simulation highlighted this issue by showing that condition-specific effects can result in over-clustering and downward bias in the estimated distances (Methods, Fig. S23). Users can avoid these false negatives by using annotation approaches that can control for patient and condition-specific effects. scDist provides two diagnostic tools to help users identify potential issues in their annotation (Figs. S24 and S6. Despite this, significant errors in clustering and annotation could cause unavoidable bias in scDist, and thus designing a cluster-free extension of scDist is an area for future work. scDist also provides a diagnostic tool that estimates distances at multiple resolutions to help users identify potential issues in their annotation (Fig. S24). Another point of sensitivity for scDist is the choice of the number of principal components used to estimate the distance. Although in practice we observed that the distance estimate is stable as the number of PCs varies between 20 and 50 (Fig. S22), an adaptive approach for selecting K could improve performance and maximize power. Finally, although Pearson residual-based normalized counts^12,13 is recommended input for scDist, if the data available was normalized by another, sub-optimal, approach, scDist’s performances could be affected. A future version could adapt the model and estimation procedure so that scDist can be directly applied to the counts, and avoid potential problems introduced by normalization.

We believe that scDist will have extensive utility, as the comparison of single-cell experiments between groups is a common task across a range of research and clinical applications. In this study, we have focused on examining discrete phenotypes, such as infected versus non-infected (in COVID-19 studies) and responders vs. non-responders to checkpoint inhibitors. However, the versatility of our framework allows for extension to experiments involving continuous phenotypes or conditions, such as height, survival, and exposure levels, to name a few. As single-cell datasets continue to grow in size and complexity, scDist will enable rigorous and reliable insights into cellular perturbations with implications for human health and disease.

Methods

Normalization

Our method takes as input a normalized count matrix (with corresponding cell type annotations). We recommend using scTransform¹³ to normalize, although the method is compatible with any normalization approach. Let y_ijg be the UMI counts for gene 1 ≤ g ≤ G in cell i from sample j. scTransform fits the following model:

$${y}_{ijg} \sim {{\rm{NB}}}({\mu }_{g},{\alpha }_{g})$$

(3)

$$\log {\mu }_{g}={\beta }_{0g}+{\beta }_{1g}\log {r}_{ij}$$

(4)

where r_ij is the total number of UMI counts for the particular cell. The normalized counts are given by the Pearson residuals of the above model:

$${z}_{ijg}=\frac{{y}_{ijg}-{\hat{\mu }}_{g}}{\sqrt{{\hat{\mu }}_{g}+{\hat{\mu }}_{g}^{2}/{\hat{\alpha }}_{g}}}$$

(5)

Distance in normalized expression space

In this section, we describe the inferential procedure of scDist for cases without additional covariates. However, the procedure can be generalized to the full model (18) with arbitrary covariates (design matrix) incorporating random and fixed effects, as well as nested-effect mixed models. For a given cell type, we model the G-dimensional vector of normalized counts as

$${\boldsymbol{z}}_{ij}={\boldsymbol{\alpha}}+{x}_{ij}\,{\boldsymbol{\beta}}+{\boldsymbol{\omega }}_{j}+{\boldsymbol{\varepsilon }}_{{ij}}$$

(6)

where ${\boldsymbol{\alpha}},{\boldsymbol{\beta}} \in {{\mathbb{R}}}^{G}$, x_ij is a binary indicator of condition, ${\boldsymbol{\omega }}_{j} \sim {{\mathcal{N}}}(0,{\tau }^{2}{I}_{G})$, and ${\boldsymbol{\varepsilon }}_{ij} \sim {{\mathcal{N}}}(0,{\sigma }^{2}{I}_{G})$. The quantity of interest is the Euclidean distance between condition means α and α + β:

$$D:=\sqrt{{\boldsymbol{\beta }}^{T}{\boldsymbol{\beta}} }=\left| \left| {\boldsymbol{\beta}} \right| \right| _{2}$$

(7)

If $U\in {{\mathbb{R}}}^{G\times G}$ is an orthonormal matrix, we can apply U to equation (6) to obtain the transformed model:

$$U{\boldsymbol{z}}_{ij}=U{\boldsymbol{\alpha}}+{x}_{ij}U{\boldsymbol{\beta}}+U{\boldsymbol{\omega} }_{j}+U{\boldsymbol{\varepsilon} }_{ij}$$

(8)

Since U is orthogonal, Uω_j and Uε_ij still have spherical normal distributions. We also have that

$${(U{\boldsymbol{\beta}} )}^{T}(U{\boldsymbol{\beta}} )={{\boldsymbol{\beta}} }^{T}{\boldsymbol{\beta}}={D}^{2}$$

(9)

This means that the distance in the transformed model is the same as in the original model. As mentioned earlier, our goal is to find U such that

$${D}_{K}:=\sqrt{{\sum} _{k=1}^{K}{(U{\boldsymbol{\beta}} )}_{k}^{2}} \, \approx \, D$$

(10)

with K ≪ G.

Let $Z\in {{\mathbb{R}}}^{n\times G}$ be the matrix with rows z_ij (where n is the total number of cells). Intuitively, we want to choose a U such that the projection of z_ij onto the first K rows of U (${u}_{1},\ldots,{u}_{K}\in {{\mathbb{R}}}^{G}$) minimizes the reconstruction error

$$\sum _{i=1}^{n}| | {z}_{i}-(\mu+{v}_{i1}{u}_{1}+\cdots+{v}_{iK}{u}_{K})| {| }_{2}^{2}$$

(11)

where $\mu \in {{\mathbb{R}}}^{G}$ is a shift vector and $({v}_{ik})\in {{\mathbb{R}}}^{n\times K}$ is a matrix of coefficients. It can be shown that the PCA of Z yields the (orthornormal) u₁, …, u_K that minimizes this reconstruction error²⁶.

Inference

Given an estimator $\widehat{{(U{\boldsymbol{\beta}} )}_{k}}$ of (Uβ)_k, a naive estimator of D_K is given by taking the square root of the sum of squared estimates:

$$\sqrt{{\sum} _{k=1}^{K}{\widehat{{(U{\boldsymbol{\beta}} )}_{k}}}^{2}}.$$

(12)

However, this estimator can have significant upward bias due to sampling variability. For instance, even if the true distance is 0, $\widehat{{(U\beta )}_{k}}$ is unlikely to be exactly zero, and that noise becomes strictly positive when squaring.

To account for this, we apply a post-hoc Bayesian procedure to the ${\widehat{U\beta }}_{k}$ to shrink them towards zero before computing the sum of squares. In particular, we adopt the spike slab model of¹⁴

$$\widehat{{(U{\boldsymbol{\beta}} )}_{k}} \sim {{\mathcal{N}}}\left({(U{\boldsymbol{\beta}} )}_{k},{{\rm{Var}}}\left[\widehat{{(U{\boldsymbol{\beta}} )}_{k}}\right]\right)$$

(13)

$${(U{\boldsymbol{\beta}} )}_{k} \sim {\pi }_{0}{\delta }_{0}+\sum _{t=1}^{T}{\pi }_{t}{{\mathcal{N}}}(0,{\tau }_{t})$$

(14)

where ${{\rm{Var}}}[\widehat{{(U{\boldsymbol{\beta}} )}_{k}}]$ is the variance of the estimator $\widehat{{(U{\boldsymbol{\beta}} )}_{k}}$, δ₀ is a point mass at 0, and π₀, π₁, …π_T are mixing weights (that is, they are non-negative and sum to 1).¹⁴ provides a fast empirical Bayes approach to estimate the mixing weights and obtain posterior samples of (Uβ)_k. Then samples from the posterior of D_K are obtained by applying the formula (12) to the posterior samples of (Uβ)_k. We then summarize the posterior distribution by reporting the median and other quantiles. Advantage of this particular specification is that the amount of shrinkage depends on the uncertainty in the initial estimate of (Uβ)_k.

We use the following procedure to obtain ${\widehat{U\beta }}_{k}$:

1.
Use the matrix of PCA loadings as a plug in estimator for U. Then Uz_ij is the vector of PC scores for cell i in sample j.
2.
Estimate (Uβ)_k by using lme4²⁷ to fit the model (6) using the PC scores corresponding to the k-th loading (i.e., each dimension is fit independently).

Note that only the first K rows of U need to be stored.

We are particularly interested in testing the null hypothesis of D_K = 0 against the alternative D_d > 0. Because the null hypothesis corresponds to (Uβ)_k = 0 for all 1 ≤ k ≤ d, we can use the sum of individual Wald statistics as our test statistic:

$$W=\sum _{k=1}^{K}{W}_{k}=\sum _{k=1}^{K}{\left(\frac{{\widehat{(U{\boldsymbol{\beta}} )}}_{k}}{\widehat{{{\rm{se}}}}\left[{\widehat{(U{\boldsymbol{\beta}} )}}_{k}\right]}\right)}^{2}$$

(15)

Under the null hypothesis that (Uβ)_k = 0, W_k can be approximated by a ${F}_{{\nu }_{k},1}$ distribution. ν_k is estimated using Satterthwaite’s approximation in lmerTest. This implies that

$$W \sim \sum _{k=1}^{K}{F}_{{\nu }_{k},1}$$

(16)

under the null. Moreover, the W_k are independent because we have assumed that covariance matrices for the sample and cell-level noise are multiples of the identity. Equation (16) is not a known distribution but quantiles can be approximated using Monte Carlo samples. To make this precise, let W₁, …, W_M be draws from equation (16), where M = 10⁵ and let W^* be the value of equation (15) (i.e., the actual test statistic). Then the empirical p-value²⁸ is computed as

$$\frac{{\sum }_{i=1}^{M}I({W}_{i} \, > \, {W}^{*})+1}{M+1}$$

(17)

Controlling for additional covariates

Because scDist is based on a linear model, it is straightforward to control for additional covariates such as age or sex of a patient in the analysis. In particular, model (18) can be replaced with

$${\boldsymbol{z}}_{ij}={\boldsymbol{\alpha}}+{x}_{j}{\boldsymbol{\beta}}+\mathop{\sum }_{k=1}^{p}{w}_{ijk}{\boldsymbol{\gamma }}_{k}+{\boldsymbol{\omega }}_{j}+{\boldsymbol{\varepsilon }}_{ij}$$

(18)

where ${w}_{ijk}\in {\mathbb{R}}$ is the value of the kth covariate for cell i in sample j and ${\boldsymbol{\gamma }}_{k}\in {{\mathbb{R}}}^{G}$ is the corresponding gene-specific effect corresponding to the kth covariate.

Choosing the number of principal components

An important choice in scDist is the number of principal components d. If d is chosen too small, then estimation accuracy may suffer as the first few PCs may not capture enough of the distance. On the other hand, if d is chosen too large then the power may suffer as a majority of the PCs will simply be capturing random noise (and adding to degrees of freedom to the Wald statistic). Moreover, it is important that d is chosen a priori, as choosing the d that produces the lowest p values is akin to p-hacking.

If the model is correctly specified then it is reasonable to choose d = J − 1, where J is the number of samples (or patients). To see why, notice that the mean expression in sample 1 ≤ j ≤ J is

$${x}_{\cdot j}{\boldsymbol{\beta}}+{\omega }_{j}\in {{\mathbb{R}}}^{G}$$

(19)

In particular, the J sample means lie on a (J − 1)-dimensional subspace in ${{\mathbb{R}}}^{G}$. Under the assumption that the condition difference and sample-level variability is larger than the error variance σ², we should expect that the first J − 1 PC vectors capture all of the variance due to differences in sample means.

In practice, however, the model can not be expected to be correctly specified. For this reason, we find that d = 20 is a reasonable choice when the number of samples is small (as is usually the case in scRNA-seq) and d = 50 for datasets with a large number of samples. This is line with other single-cell methods, where the number of PCs retained is usually between 20 and 50.

Cell type annotation and “double dipping”

scDist takes as input an annotated list of cells. A common approach to annotate cells is to cluster based on gene expression. Since scDist also uses the gene expression data to measure the condition difference there are concerns associated with “double-dipping” or using the data twice. In particular, if the condition difference is very large and all of the data is used to cluster it is possible that the cells in the two conditions would be assigned to different clusters. In this case scDist would be unable to estimate the inter-condition distance, leading to a false negative. In other words, the issue of double dipping could cause scDist to be more conservative. Note that the opposite problem occurs when performing differential expression between two estimated clusters; in this case, the p-values corresponding to genes will be anti-conservative²⁹.

To illustrate, we simulated a normalized count matrix with 4000 cells and 1000 genes in such a way that there are two “true” cell types and a true condition distance of 4 for both cell types (Fig. S23a). To cluster (annotate) the cells, we applied k-means with various choices of k and compared results by taking the median inter-condition distance across all clusters. As the number of clusters increases, the median distance decays towards 0, which demonstrates that scDist can produce false negatives when the data is over-clustered (Fig. S23b). To avoid this issue, one possible approach is to begin by clustering the data for only one condition and then to assign cells in the other condition by finding the nearest centroid in the existing clusters. When applied to the simulated data this approach is able to correctly estimate the condition distance even when the number of clusters k is larger than the true value.

On real data, one approach to identify possible over-clustering is to apply scDist at various cluster resolutions. We used the expression data from the small COVID-19 data¹ to construct a tree ${{\mathcal{T}}}$ with leaf nodes corresponding to the cell types in the original annotation provided by the authors (Fig. S24, see Appendix A for a description of how the tree is estimated). At each internal node $v\in {{\mathcal{T}}}$, we applied scDist to the cluster containing all children of v. We can then visualize the estimated distances by plotting the tree (Fig. S24). Situations where the child nodes have a small distance but the parent node has a large distance could be indicative of over-clustering. For example, PB cells are almost exclusiviely found in cases (1977 cells in cases and 86 cells in controls), suggesting that it is reasonable to consider PB and B cells as a single-cell type when applying scDist.

Feature importance

To better understand the genes that drive the observed difference in the CD14+ monocytes, we define a gene importance score. For 1 ≤ k ≤ d and 1 ≤ g ≤ G, the k-th importance score for gene g is ∣U_kg∣β_g. In other words, the importance score is the absolute value of the gene’s k-th PC loading times its expression difference between the two conditions. Note that the gene importance score is 0 if and only if β_g = 0 or U_kg = 0. Since the U_kg are fixed and known, significance can be assigned to the gene importance score using the differential expression method used to estimate β_g.

Simulated single-cell data

We test the method on data generated from model equation (6). To ensure that the “true” distance is D, we use the R package uniformly³⁰ to draw β from the surface of the sphere of radius D in ${{\mathbb{R}}}^{G}$. The data in Figs. 1C and 3C are obtained by setting β = 0 and σ² = 1 and varying τ² between 0 and 1.

Weighted distance

By default, scDist uses the Euclidean distance D which treats each gene equally. In cases where a priori information is available about the relevance of each gene, scDist provides the option to estimate a weighted distance D_w, where $w\in {{\mathbb{R}}}^{G}$ has non-negative components and

$${D}_{w}=\sum _{g=1}^{G}{w}_{g}{\beta }_{g}^{2}$$

(20)

The weighted distance can be written in matrix form by letting $W\in {{\mathbb{R}}}^{G\times G}$ be a diagonal matrix with W_gg = w_g, so that

$${D}_{w}={\boldsymbol{\beta }}^{\top }W{\boldsymbol{\beta}}$$

(21)

Thus, the weighted distance can be estimated by instead considered the transformed model where $U\sqrt{W}$ is applied to each z_ij. After this different transformed model is obtained, estimation and inference of D_w proceeds in exactly the same way as the unweighted case.

To test the accuracy of the weighted distance estimate, we considered a simulation where each gene had only a 10% chance of having β_g ≠ 0 (otherwise ${\beta }_{g} \sim {{\mathcal{N}}}(0,1)$). We then considered three scenarios: w_g = 1 if β_g ≠ 0 and w_g = 0 otherwise (correct weighting), w_g = 1 for all g (unweighted), and w_g = 1 randomly with probability 0.1 (incorrect weights). We then quantified the performance by taking the absolute value of the error between ${\sum }_{g}{\beta }_{g}^{2}$ and the estimated distance. Figure S3 shows that correct weighting slightly outperforms unweighted scDist but random weights are significantly worse. Thus, the unweighted version of scDist should be preferred unless strong a priori information is available.

Robustness to model misspecification

The scDist model assumes that the cell-specific variance σ² and sample-specific variance τ² are shared across genes. The purpose of this assumption is to ensure that the noise in the transformed model follows a spherical normal distribution. Violations of this assumption could lead to miscalibrated standard errors and hypothesis tests but should not effect estimation. To demonstrate this, we considered simulated data where each gene has σ_g ~ Gamma(r, r) and τ_g ~ Gamma(r/2, r). As r varies, the quality of the distance estimates does not change significantly (Fig. S26).

Semi-simulated COVID-19 data

COVID-19 patient data for the analysis was obtained from ref. ¹⁷, containing 1.4 million cells of 64 types from 284 PBMC samples collected from 196 individuals, including 171 COVID-19 patients and 25 healthy donors.

Ground truth

We define the ground truth as the cell-type specific transcriptomic differences between the 171 COVID-19 patients and the 25 healthy controls. Specifically, we used the following approach to define a ground truth distance:

1.
For each gene g, we computed the log fold changes L_g between COVID-19 cases and controls, with L_g = E_g(Covid) − E_g(Control), where E_g denotes the log-transformed expression data $\log (1+x)$.
2.
The ground truth distance is then defined as $D={\sum }_{g}{L}_{g}^{2}$.

Subsequently, we excluded any cell types not present in more than 10% of the samples from further analysis. For true negative cell types, we identified the top 5 with the smallest fold change and a representation of over 20,000 cells within the entire dataset. When attempting similar filtering based on cell count alone, no cell types demonstrated a sufficiently large true distance. Consequently, we chose the top four cell types with over 5000 cells as our true positives Fig. S11.

Using the ground truth, we performed two separate simulation analyses:

1: Simulation analyses I (Fig. 5A, B): Using one half of the dataset (712621 cells, 132 case samples, 20 control samples), we created 100 subsamples consisting of 5 cases and 5 controls. For each subsample, we applied both scDist and Augur to estimate perturbation/distance between cases and controls for each cell type. Then we computed the correlation between the ground truth ranking (ordering cells by sum of log fold changes on the whole dataset) and the ranking obtained by both methods. For scDist, we restricted to cell types that had a non-zero distance estimate in each subsample, and for Augur we restricted to cell types that had an AUC greater than 0.5 (Fig. 5A). For Fig. 5B, we took the mean estimated distance across subsamples for which the given cell type had a non-zero distance estimate. This is because in some subsamples a given cell type could be completely absent.
2: Simulation analyses II (Fig. 5C–F): We subsampled the COVID-19 cohort with 284 samples (284 PBMC samples from 196 individuals: 171 with COVID-19 infection and 25 healthy controls) to create 1,000 downsampled cohorts, each containing samples from 10 individuals (5 with COVID-19 and 5 healthy controls). We randomly selected each sample from the downsampled cohort, further downsampled the number of cells for each cell type, and selected them from the original COVID-19 cohort. This downsampling procedure increases both cohort variability and cell-number variations.
Performance Evaluation in Subsampled Cohorts : We applied scDist and Augur to each subsampled cohort, comparing the results for true positive and false positive cell types. We partitioned the sampled cohorts into 10 groups based on cell-number variation, defined as the number of cells in a sample with the highest number of cells for false-negative cell types divided by the average number of cells in cell types. This procedure highlights the vulnerability of computational methods to cell number variation, particularly in negative cell types.

Analysis of immunotherapy cohorts

Data collection

We obtained single-cell data from four cohorts^2,23,24,25, including expression counts and patient response information.

Pre-processing

To ensure uniform processing and annotation across the four scRNA cohorts, we analyzed CD45+ cells (removing CD45− cells) in each cohort and annotated cells using Azimuth³¹ with reference provided for CD45+ cells.

Model to account for cohort and sample variance

To account for cohort-specific and sample-specific batch effects, scDist modeled the normalized gene expression as:

$$Z \sim X+(1| \gamma :\omega )$$

(22)

Here, Z represents the normalized count matrix, X denotes the binary indicator of condition (responder = 1, non-responder = 0); γ and ω are cohort and sample-level random effects, and (1∣γ: ω) models nested effects of samples within cohorts. The inference procedure for distance, its variance, and significance for the model with multiple cohorts is analogous to the single-cohort model.

Signature

We estimated the signature in the NK-2 cell type using differential expression between responders and non-responders. To account for cohort-specific and patient-specific effects in differential expression estimation, we employed a linear mixed model described above for estimating distances, performing inference for each gene separately. The coefficient of X inferred from the linear mixed models was used as the estimate of differential expression:

$$Z \sim X+(1| \gamma :\omega )$$

(23)

Here, Z represents the normalized count matrix, X denotes the binary indicator of condition (responder = 1, non-responder = 0); γ and ω are cohort and sample-level random effects, and (1∣γ: ω) models nested effects of samples within cohorts.

Bulk RNA-seq cohorts

We obtained bulk RNA-seq data from seven cancer cohorts^{32,33,34,35,36,37,38}, comprising a total of 789 patients. Within each cohort, we converted counts of each gene to TPM and normalized them to zero mean and unit standard deviation. We collected survival outcomes (both progression-free and overall) and radiologic-based responses (partial/complete responders and non-responders with stable/progressive disease) for each patient.

Evaluation of signature in bulk RNA-seq cohorts

We scored each bulk transcriptome (sample) for the signature using the strategy described in ref. ³⁹. Specifically, the score was defined as the Spearman correlation between the normalized expression and differential expression in the signature. We stratified patients into two groups using the median score for patient stratification. Kaplan–Meier plots were generated using these stratifications, and the significance of survival differences was assessed using the log-rank test. To demonstrate the association of signature levels with radiological response, we plotted signature levels separately for non-responders, partial-responders, and responders.

Evaluating Augur Signature in Bulk RNA-Seq Cohorts

A differential signature was derived for Augur’s top prediction, plasma cells, using a procedure analogous to the one described above for scDist. This plasma signature was then assessed in bulk RNA-seq cohorts following the same evaluation strategy as applied to the scDist signature.

Statistics and reproducibility

No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Table 1 gives a list of the datasets used in each figure, as well as details about how the datasets can be obtained. Source data are provided with this paper.

Code availability

scDist is available as an R package and can be downloaded from GitHub⁴⁰: github.com/phillipnicol/scDist. The repository also includes scripts to replicate some of the figures and a demo of scDist using simulated data.

References

Wilk, A. J. et al. A single-cell atlas of the peripheral immune response in patients with severe covid-19. Nat. Med. 26, 1070–1076 (2020).
Article PubMed PubMed Central CAS Google Scholar
Yuen, K. C. et al. High systemic and tumor-associated il-8 correlates with reduced clinical benefit of pd-l1 blockade. Nat. Med. 26, 693–698 (2020).
Article PubMed PubMed Central CAS Google Scholar
Crowell, H. L. et al. Muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data. Nat. Commun. 11, 6077 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Helmink, B. A. et al. B cells and tertiary lymphoid structures promote immunotherapy response. Nature 577, 549–555 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Zhao, J. et al. Detection of differentially abundant cell subpopulations in scrna-seq data. Proc. Natl. Acad. Sci. 118, e2100293118 (2021).
Article PubMed PubMed Central CAS Google Scholar
Dann, E., Henderson, N. C., Teichmann, S. A., Morgan, M. D. & Marioni, J. C. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat. Biotechnol. 40, 245–253 (2022).
Article PubMed CAS Google Scholar
Burkhardt, D. B. et al. Quantifying the effect of experimental perturbations at single-cell resolution. Nat. Biotechnol. 39, 619–629 (2021).
Article PubMed PubMed Central CAS Google Scholar
McInnes, L., Healy, J. & Melville, J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv https://arxiv.org/abs/1802.03426 (2018).
Skinnider, M. A. et al. Cell type prioritization in single-cell data. Nat. Biotechnol. 39, 30–34 (2021).
Article PubMed CAS Google Scholar
Zimmerman, K. D., Espeland, M. A. & Langefeld, C. D. A practical solution to pseudoreplication bias in single-cell studies. Nat. Commun. 12, 1–9 (2021).
Article Google Scholar
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods 16, 1289–1296 (2019).
Article PubMed PubMed Central CAS Google Scholar
Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model. Genome Biol. 20, 1–16 (2019).
Article Google Scholar
Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20, 1–15 (2019).
Article Google Scholar
Stephens, M. False discovery rates: a new deal. Biostatistics 18, 275–294 (2017).
MathSciNet PubMed Google Scholar
Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).
Article ADS PubMed PubMed Central CAS Google Scholar
Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7, 1141 (2018).
Ren, X. et al. Covid-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell 184, 1895–1913 (2021).
Article PubMed PubMed Central CAS Google Scholar
Galati, D., Zanotta, S., Capitelli, L. & Bocchino, M. A bird’s eye view on the role of dendritic cells in sars-cov-2 infection: Perspectives for immune-based vaccines. Allergy 77, 100–110 (2022).
Article PubMed CAS Google Scholar
Pérez-Gómez, A. et al. Dendritic cell deficiencies persist seven months after sars-cov-2 infection. Cell. Mol. Immunol. 18, 2128–2139 (2021).
Article PubMed PubMed Central Google Scholar
Upadhyay, A. A. et al. Trem2+ and interstitial macrophages orchestrate airway inflammation in sars-cov-2 infection in rhesus macaques. bioRxiv https://www.biorxiv.org/content/10.1101/2021.10.05.463212v1 (2021).
Wang, S. et al. S100a8/a9 in inflammation. Front. Immunol. 9, 1298 (2018).
Article PubMed PubMed Central Google Scholar
Mellett, L. & Khader, S. A. S100a8/a9 in covid-19 pathogenesis: impact on clinical outcomes. Cytokine Growth Factor Rev. 63, 90–97 (2022).
Article PubMed CAS Google Scholar
Luoma, A. M. et al. Tissue-resident memory and circulating t cells are early responders to pre-surgical cancer immunotherapy. Cell 185, 2918–2935 (2022).
Article PubMed PubMed Central CAS Google Scholar
Yost, K. E. et al. Clonal replacement of tumor-specific t cells following pd-1 blockade. Nat. Med. 25, 1251–1259 (2019).
Article PubMed PubMed Central CAS Google Scholar
Sade-Feldman, M. et al. Defining T cell states associated with response to checkpoint immunotherapy in melanoma. Cell 175, 998–1013 (2018).
Article PubMed PubMed Central CAS Google Scholar
Pearson, K. Liii. on lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2, 559–572 (1901).
Article Google Scholar
Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67, 1–48 (2015).
Article Google Scholar
North, B. V., Curtis, D. & Sham, P. C. A note on the calculation of empirical p values from Monte Carlo procedures. Am. J. Hum. Genet. 71, 439–441 (2002).
Article PubMed PubMed Central CAS Google Scholar
Neufeld, A., Gao, L. L., Popp, J., Battle, A. & Witten, D. Inference after latent variable estimation for single-cell RNA sequencing data. arXiv https://arxiv.org/abs/2207.00554 (2022).
Laurent, S. uniformly: uniform sampling. R package version 0.2.0 https://CRAN.R-project.org/package=uniformly (2022).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587 (2021).
Article PubMed PubMed Central CAS Google Scholar
Mariathasan, S. et al. Tgfβ attenuates tumour response to pd-l1 blockade by contributing to exclusion of T cells. Nature 554, 544–548 (2018).
Article ADS PubMed PubMed Central CAS Google Scholar
Weber, J. S. et al. Sequential administration of nivolumab and ipilimumab with a planned switch in patients with advanced melanoma (checkmate 064): an open-label, randomised, phase 2 trial. Lancet Oncol. 17, 943–955 (2016).
Article PubMed PubMed Central CAS Google Scholar
Liu, D. et al. Integrative molecular and clinical modeling of clinical outcomes to pd1 blockade in patients with metastatic melanoma. Nat. Med. 25, 1916–1927 (2019).
Article PubMed PubMed Central CAS Google Scholar
McDermott, D. F. et al. Clinical activity and molecular correlates of response to atezolizumab alone or in combination with bevacizumab versus sunitinib in renal cell carcinoma. Nat. Med. 24, 749–757 (2018).
Article PubMed PubMed Central CAS Google Scholar
Riaz, N. et al. Tumor and microenvironment evolution during immunotherapy with nivolumab. Cell 171, 934–949 (2017).
Article PubMed PubMed Central CAS Google Scholar
Miao, D. et al. Genomic correlates of response to immune checkpoint therapies in clear cell renal cell carcinoma. Science 359, 801–806 (2018).
Article ADS PubMed PubMed Central CAS Google Scholar
Van Allen, E. M. et al. Genomic correlates of response to ctla-4 blockade in metastatic melanoma. Science 350, 207–211 (2015).
Article ADS PubMed PubMed Central Google Scholar
Sahu, A. et al. Discovery of targets for immune–metabolic antitumor drugs identifies estrogen-related receptor alpha. Cancer Discov. 13, 672–701 (2023).
Article PubMed PubMed Central CAS Google Scholar
Nicol, P. scdist https://doi.org/10.5281/zenodo.12709683 (2024).

Download references

Acknowledgements

P.B.N. is supported by NIH T32CA009337. A.D.S. received support from R00CA248953, the Michelson Foundation, and was partially supported by the UNM Comprehensive Cancer Center Support Grant NCI P30CA118100. We express our gratitude to Adrienne M. Luoma, Shengbao Suo, and Kai W. Wucherpfennig for providing the scRNA data²³. We also thank Zexian Zeng for assistance with downloading and accessing the bulk RNA-seq dataset.

Author information

Authors and Affiliations

Harvard University, Cambridge, MA, USA
Phillip B. Nicol & Danielle Paulson
University of California San Diego School of Medicine, San Diego, CA, USA
Gege Qian
Dana-Farber Cancer Institute, Boston, MA, USA
X. Shirley Liu & Rafael Irizarry
University of New Mexico Comprehensive Cancer Center, Albuquerque, NM, USA
Avinash D. Sahu

Authors

Phillip B. Nicol
View author publications
You can also search for this author in PubMed Google Scholar
Danielle Paulson
View author publications
You can also search for this author in PubMed Google Scholar
Gege Qian
View author publications
You can also search for this author in PubMed Google Scholar
X. Shirley Liu
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Irizarry
View author publications
You can also search for this author in PubMed Google Scholar
Avinash D. Sahu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.B.N., D.P., G.Q., X.S.L., R.I., and A.D.S. conceived the study. P.B.N. and A.D.S. implemented the method and performed the experiments. P.B.N., R.I., and A.D.S. wrote the manuscript.

Corresponding authors

Correspondence to Rafael Irizarry or Avinash D. Sahu.

Ethics declarations

Competing interests

X.S.L. conducted the work while being on the faculty at DFCI, and is currently a board member and CEO of GV20 Therapeutics. P.B.N., D.P., G.Q., R.I., and A.D.S. declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nicol, P.B., Paulson, D., Qian, G. et al. Robust identification of perturbed cell types in single-cell RNA-seq data. Nat Commun 15, 7610 (2024). https://doi.org/10.1038/s41467-024-51649-3

Download citation

Received: 14 December 2023
Accepted: 09 August 2024
Published: 01 September 2024
DOI: https://doi.org/10.1038/s41467-024-51649-3
Springer Nature Limited

Robust identification of perturbed cell types in single-cell RNA-seq data

Abstract

Similar content being viewed by others

Introduction

Results

Not accounting for individual-to-individual variability leads to false positives

A model-based distance metric controls for false positives

Comparison to counting the number of DEGs

scDist detects cell types that are different in COVID-19 patient compared to controls

scDist enables the identification of genes underlying cell-specific across-condition differences

scDist identifies transcriptomic alterations associated with immunotherapy response

scDist is computationally efficient

Discussion

Methods

Normalization

Distance in normalized expression space

Inference

Controlling for additional covariates

Choosing the number of principal components

Cell type annotation and “double dipping”

Feature importance

Simulated single-cell data

Weighted distance

Robustness to model misspecification

Semi-simulated COVID-19 data

Ground truth

Analysis of immunotherapy cohorts

Data collection

Pre-processing

Model to account for cohort and sample variance

Signature

Bulk RNA-seq cohorts

Evaluation of signature in bulk RNA-seq cohorts

Evaluating Augur Signature in Bulk RNA-Seq Cohorts

Statistics and reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation