SC3s: efficient scaling of single cell consensus clustering to millions of cells

Quah, Fu Xiang; Hemberg, Martin

doi:10.1186/s12859-022-05085-z

SC3s: efficient scaling of single cell consensus clustering to millions of cells

Software
Open access
Published: 12 December 2022

Volume 23, article number 536, (2022)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

SC3s: efficient scaling of single cell consensus clustering to millions of cells

Download PDF

2116 Accesses
7 Citations
4 Altmetric
Explore all metrics

Abstract

Background

Today it is possible to profile the transcriptome of individual cells, and a key step in the analysis of these datasets is unsupervised clustering. For very large datasets, efficient algorithms are required to ensure that analyses can be conducted with reasonable time and memory requirements.

Results

Here, we present a highly efficient k-means based approach, and we demonstrate that it scales favorably with the number of cells with regards to time and memory.

Conclusions

We have demonstrated that our streaming k-means clustering algorithm gives state-of-the-art performance while resource requirements scale favorably for up to 2 million cells.

SCANPY: large-scale single-cell gene expression data analysis

Article Open access 06 February 2018

pcaReduce: hierarchical clustering of single cell transcriptional profiles

Article Open access 22 March 2016

SC3: consensus clustering of single-cell RNA-seq data

Article 27 March 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Background

Technological advances have paved the way for single cell RNAseq (scRNAseq) datasets containing several million cells [1]. Such large datasets require highly efficient algorithms to enable analyses at reasonable times and hardware requirements [2]. A crucial step in single cell workflows is unsupervised clustering, which aims to delineate putative cell types or cell states based on transcriptional similarity [3]. The most popular methods for unsupervised clustering of scRNAseq data are the Louvain and Leiden algorithms. They represent cells as a neighborhood graph where densely connected modules are identified as clusters [4]. However, these methods can be biased by a poorly specified graph, running the risk of identifying structures that are not present in the data [5]. More generally, as it can be shown that no single clustering algorithm will feature all desired statistical properties and perform well for all datasets, the field would benefit from additional methodologies [6].

One of the most widely used unsupervised clustering in general is k-means clustering, and it forms the basis of several methodologies, including scCCESS [7], SCCAF [8] and the single cell consensus clustering (SC3) algorithm [9]. To achieve robust and accurate results SC3 uses a consensus approach whereby a large number of parameter combinations are evaluated and subsequently combined. However, both the k-means clustering and the consensus algorithm come at significant computational costs: both the run time and memory use scale more than quadratically with the number of cells, prohibiting application to large datasets, which are becoming increasingly commonplace with ever improving sequencing technologies.

Implementation

Here, we present a new version of this algorithm, single cell consensus clustering with speed (SC3s), where several steps of the original workflow have been optimized to ensure that both run time and memory usage scale linearly with the number of cells (Fig. 1; Additional file 1: Fig. S1). This is achieved by using a streaming approach for the k-means clustering [10], as implemented in the scikit-learn package [11], which makes it possible to only process a small subset of cells in each iteration. Each of the subsets can be efficiently processed at constant time and memory. In addition, as part of an intermediary step, which was not part of the original method, a large number of microclusters are calculated. The microclusters can be reused for different choices of k, and this allows substantial savings when analyzing multiple values of k, something that is very common in practice during data exploration. We have also improved the consensus step by adopting a one-hot encoding approach [12], as opposed to the original co-association based method, on which the k-means clustering algorithm could be run more efficiently (Additional file 1: Fig. S2).

Results

To evaluate the accuracy of SC3s we used eight datasets with < 10,000 cells where the cell labels are known or have been defined using orthogonal methods, allowing us to compare the results of the transcriptome clustering to a ground truth [9] (Additional file 1: Table S1). These benchmarks show that SC3s has an accuracy which is comparable to the original algorithm (Fig. 2), and that the performance is robust across a broad range of user-customisable parameters (Additional file 1: Figs. S3-S5). Finally, SC3s compares favorably against other clustering methodologies, such as Scanpy, Seurat, FastPG and scDHA, in terms of its accuracy, memory usage and runtime (Fig. 2; Additional file 1: Figs. S1, S6).

To examine the performance for large datasets, SC3s was benchmarked on the mouse organogenesis cell atlas dataset which contains 2,026,641 cells [1]. Processing, filtering and dimensionality reduction were performed as in the original publication, after which the clustering performance of SC3s was assessed. Compared to the other packages, SC3s was able to achieve both a short runtime and a low memory usage, whilst producing consistent clusters. For example, when compared to the Leiden algorithm, the peak memory usage was similar, but SC3s was ~ 18 times faster (20 min vs 6 h), even when evaluating five k values (Table 1). The slightly lower accuracy was expected because cell labels used for comparison originated from the Louvain algorithm, a method very similar to the Leiden algorithm, making them an imperfect ground truth. Visual inspection of the assigned labels also revealed that SC3s was able to capture the major structures identified by the authors (Additional file 1: Fig. S7).

Table 1 Runtime, memory and ARI performance benchmarked on the 2 million mouse organogenesis cell atlas dataset

Full size table

Conclusions

Overall, SC3s is a major improvement over its predecessor, and it represents a scalable and accurate alternative to the widely used neighborhood graph clustering methodologies. Moreover, it is integrated with the popular Scanpy package and utilizes the same underlying data structures [13], making it easy for users to incorporate into existing workflows and to make full use of upstream and downstream functionalities in the ecosystem. Thus, SC3s will allow researchers to analyze scRNAseq datasets as they scale to millions of cells.

Availability and requirements

Project name: SC3s. Project home page: https://github.com/hemberg-lab/sc3s/. Operating system: Platform independent. Programming language: python. License: BSD-3. Other requirements: None. Restrictions to use by non-academics: None.

Availability of data and materials

All datasets used for benchmarking are available publically, and they are listed in Additional file 1: Table S1. The Python code for SC3s is licensed under a BSD-3 Clause License. Instructions to install from pip and conda channels are available on GitHub: https://github.com/hemberg-lab/sc3s.

Abbreviations

scRNAseq:: Single cell RNAseq
SC3:: Single cell consensus clustering
SC3s:: Single cell consensus clustering with speed

References

Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature. 2019;566:496–502.
Article CAS Google Scholar
Melsted P, Booeshaghi AS, Liu L, Gao F, Lu L, Min KHJ, et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat Biotechnol. 2021. https://doi.org/10.1038/s41587-021-00870-2.
Article Google Scholar
Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet. 2019;20:273–82.
Article CAS Google Scholar
Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9:5233.
Article CAS Google Scholar
Pasta MQ, Zaidi F. Topology of complex networks and performance limitations of community detection algorithms. IEEE Access. 2017;5:10901–14.
Article Google Scholar
Sun S, Zhu J, Ma Y, Zhou X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 2019;20:269.
Article CAS Google Scholar
Geddes TA, Kim T, Nan L, Burchfield JG, Yang JYH, Tao D, et al. Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis. BMC Bioinform. 2019;20(Suppl 19):660.
Article CAS Google Scholar
Miao Z, Moreno P, Huang N, Papatheodorou I, Brazma A, Teichmann SA. Putative cell type discovery from single-cell gene expression data. Nat Methods. 2020;17:621–8.
Article CAS Google Scholar
Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14:483–6.
Article CAS Google Scholar
Sculley D. Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. New York: Association for Computing Machinery; 2010. p. 1177–8.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Google Scholar
Liu H, Liu T, Wu J, Tao D, Fu Y. Spectral ensemble clustering. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2015. p. 715–24.
Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15.
Article Google Scholar

Download references

Acknowledgements

We would like to thank the Cellular Genetics Informatics team at the Wellcome Trust Sanger Institute for providing compute resources, particularly Simon Murray for helping package SC3s.

Funding

FXQ was supported by a Wellcome Trust PhD studentship. MH was funded by a core grant from the Wellcome Trust. The funder did not play any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Martin Hemberg
Present address: Evergrande Center for Immunologic Diseases, Harvard Medical School and Brigham and Women’s Hospital, 75 Francis Street, Boston, MA, 02115, USA

Authors and Affiliations

Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
Fu Xiang Quah & Martin Hemberg
The Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
Fu Xiang Quah

Authors

Fu Xiang Quah
View author publications
You can also search for this author in PubMed Google Scholar
Martin Hemberg
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The project was conceived by FXQ and MH. FXQ wrote the code and analyzed the data. MH supervised the research. FXQ and MH wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Martin Hemberg.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

None.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Contains Fig S1-S7 which provides more details about SC3s performance, and Table S1 which details the datasets used for benchmarking.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Quah, F.X., Hemberg, M. SC3s: efficient scaling of single cell consensus clustering to millions of cells. BMC Bioinformatics 23, 536 (2022). https://doi.org/10.1186/s12859-022-05085-z

Download citation

Received: 03 August 2022
Accepted: 25 November 2022
Published: 12 December 2022
DOI: https://doi.org/10.1186/s12859-022-05085-z

SC3s: efficient scaling of single cell consensus clustering to millions of cells