Scalable approaches for generating, validating and incorporating data from high-throughput functional assays to improve clinical variant classification

Padigepati, Samskruthi Reddy; Stafford, David A.; Tan, Christopher A.; Silvis, Melanie R.; Jamieson, Kirsty; Keyser, Andrew; Nunez, Paola Alejandra Correa; Nicoludis, John M.; Manders, Toby; Fresard, Laure; Kobayashi, Yuya; Araya, Carlos L.; Aradhya, Swaroop; Johnson, Britt; Nykamp, Keith; Reuter, Jason A.

doi:10.1007/s00439-024-02691-0

Scalable approaches for generating, validating and incorporating data from high-throughput functional assays to improve clinical variant classification

Original Investigation
Open access
Published: 01 August 2024

Volume 143, pages 995–1004, (2024)
Cite this article

Download PDF

You have full access to this open access article

Human Genetics Aims and scope Submit manuscript

Scalable approaches for generating, validating and incorporating data from high-throughput functional assays to improve clinical variant classification

Download PDF

Samskruthi Reddy Padigepati¹,
David A. Stafford¹,
Christopher A. Tan¹,
Melanie R. Silvis^1,2,
Kirsty Jamieson^1,2,
Andrew Keyser¹^nAff3,
Paola Alejandra Correa Nunez¹^nAff4,
John M. Nicoludis¹^nAff5,
Toby Manders¹,
Laure Fresard¹,
Yuya Kobayashi¹,
Carlos L. Araya¹^nAff7,
Swaroop Aradhya¹,
Britt Johnson¹^nAff6,
Keith Nykamp¹ &
…
Jason A. Reuter¹

985 Accesses
Explore all metrics

Abstract

As the adoption and scope of genetic testing continue to expand, interpreting the clinical significance of DNA sequence variants at scale remains a formidable challenge, with a high proportion classified as variants of uncertain significance (VUSs). Genetic testing laboratories have historically relied, in part, on functional data from academic literature to support variant classification. High-throughput functional assays or multiplex assays of variant effect (MAVEs), designed to assess the effects of DNA variants on protein stability and function, represent an important and increasingly available source of evidence for variant classification, but their potential is just beginning to be realized in clinical lab settings. Here, we describe a framework for generating, validating and incorporating data from MAVEs into a semi-quantitative variant classification method applied to clinical genetic testing. Using single-cell gene expression measurements, cellular evidence models were built to assess the effects of DNA variation in 44 genes of clinical interest. This framework was also applied to models for an additional 22 genes with previously published MAVE datasets. In total, modeling data was incorporated from 24 genes into our variant classification method. These data contributed evidence for classifying 4043 observed variants in over 57,000 individuals. Genetic testing laboratories are uniquely positioned to generate, analyze, validate, and incorporate evidence from high-throughput functional data and ultimately enable the use of these data to provide definitive clinical variant classifications for more patients.

Best practices for variant calling in clinical sequencing

Article Open access 26 October 2020

Variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics

Article Open access 16 May 2017

Assigning credit where it is due: an information content score to capture the clinical value of multiplexed assays of variant effect

Article Open access 06 September 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Variants of uncertain significance (VUSs) are frequently reported in clinical genetic testing (Burke et al. 2022; Fowler and Rehm 2024) resulting from insufficient evidence to classify the variants unambiguously as either disease-causing or benign. Data from experimental studies designed to characterize the impact of DNA variants on protein stability and function can provide strong evidence to support benign or pathogenic classifications, based on the American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) guidelines (Richards et al. 2015). Such functional evidence was relatively scarce in the scientific literature because effects on protein function have been experimentally characterized for only a small fraction of genetic variants. Over the past decade, high-throughput sequencing-based cellular assays, collectively termed multiplex assays of variant effect (MAVEs), have been developed to systematically characterize a wide array of molecular functions, including protein-protein interactions (Araya et al. 2012), enzymatic activity (Romero et al. 2015), regulatory potential (Kwasnieski et al. 2012), and protein stability (Matreyek et al. 2018). More recently, variant function has been assessed by comparing single cell gene expression profiles from cells expressing clinically relevant variation for a gene (Ursu et al. 2022). Unlike previous approaches, MAVEs enable the characterization of many DNA variants within a single pooled experiment.

Although MAVEs present a powerful opportunity to incorporate new, highly informative data into variant classification during clinical germline genetic testing, the use of MAVEs in clinical variant classification has been underutilized. Many MAVE experiments have been conducted for basic research rather than for clinical applications; thus, only a few dozen of the many hundreds of MAVE experiments conducted in recent years focused on genes associated with monogenic diseases. In conjunction, few studies adequately characterized the relevant molecular functions (e.g., signaling, protein stability, cell death) specifically associated with those diseases. Furthermore, incorporation of MAVE data into variant classification frameworks, especially in large clinical laboratory settings, has been hindered by a lack of detailed guidance from professional laboratory groups to independently build and evaluate the clinical quality of models derived from MAVE data.

To standardize and more efficiently use MAVE data, we employed a machine-learning-based evidence modeling platform to control the quality of models built from MAVE datasets and to incorporate those models that passed rigorous validation into routine clinical variant classification. MAVE experiments were designed and performed for 44 disease-related genes to generate new functional data for variants of clinical interest observed through genetic testing. Of these 44 models, 19 were selected to be included in variant classification. In addition, cellular models were assessed from an additional 22 genes built from MAVE datasets that were previously published by the academic community. Five of the 22 genes passed our quality control thresholds. In total, 24 cellular models (19 internal and 5 external) were integrated into clinical variant classification, providing additional evidence to classify over 4,000 variants in over 57,000 individuals.

Materials and methods

Cellular evidence platform for internally-generated MAVEs

For single-cell RNA-seq models developed in our laboratory, variants of interest for a target gene were selected based on whether they could be used in training models or were VUS discovered during testing by Invitae. VUS capable of being reclassified with high-quality functional evidence as well as those observed in multiple individuals were prioritized during the selection process. The selected variants were synthesized (Twist Biosciences) and designed to include unique barcodes in the 3-prime untranslated (UTR) region of the transcript. The barcoded variant pool was then cloned into a donor plasmid backbone using the DNA HiFi Assembly Mix following the manufacturer’s instructions (New England Biolabs). To identify synthesis and cloning errors and ensure high-quality starting libraries, we performed long-read sequencing of the cloned variant libraries using a Sequel IIe system (Pacific Biosciences). To introduce the variants precisely into cells, a site-specific recombination-based approach (Flp-In™ T-REx™, Thermo Scientific) was employed. First, Flp-In™ T-REx™ 293 landing pad cells were co-transfected (FuGENE6, Promega) with the variant library donor plasmid as well as a second plasmid to drive expression of the recombinase according to the manufacturer’s instructions. Forty-eight hours post-transfection, cells were split and selected with hygromycin [150 𝛍g/mL] for 9 days to isolate cells that had undergone successful recombination.

For loss-of-function targets, small interfering RNA pools (custom duplex RNAs, Horizon Biosciences) targeting the endogenous 3-prime UTR, which is not present in the exogenous copy, were transfected 72 hours prior to collecting scRNA-seq measurements, following manufacturer’s instructions (Lipofectamine™ RNAiMAX, Invitrogen). For both loss-of-function and gain-of-function targets, cell pools were exposed to doxycycline [1 𝛍g/mL] for 48 hours to induce expression of the variant library. Following induction, single-cell RNA-seq was performed following the manufacturer’s instructions (Chromium Next GEM Single Cell 3’ Kit, 10X Genomics), targeting 100 cells per variant.

Variants were associated with cells by generating a separate, targeted variant-cell barcode (VCB) library. The VCB library is derived from the single cell cDNA and contains both the variant barcode as well as the cell barcode that was introduced during the first-strand synthesis step of the scRNA-seq protocol. VCB libraries were constructed by three consecutive rounds of PCR with intervening bead (Axygen) cleanups. The first round of PCR began with 5ng of the single-cell cDNA pool and consisted of 12 cycles of PCR using a target-specific forward primer and a common reverse primer (SI-Primer, 10X Genomics) that anneals to the adapter downstream of the cell barcode. Following cleanup with 0.8x beads, a second round of 8 cycles of PCR was performed with a second internal target-specific primer and a common reverse SI-Primer and the products were purified with a dual-sided cleanup with 0.6x and 0.8x beads, respectively. A final round of PCR utilized 1ng DNA taken from the second PCR and included 8 cycles of amplification with forward and reverse primers that incorporated Illumina adapters and indices. The final amplification reaction was subjected to a dual-sided bead cleanup with 0.6x and 0.8x beads, respectively.

The VCB and RNA-seq libraries were mixed at a ratio of 30:70 and sequenced on an Illumina MiSeq instrument (Illumina). Data from the MiSeq run was used to both assess the quality of the RNA-seq libraries prior to deep sequencing and to determine the cell to variant associations (see analysis section). Following quality control, the scRNA-seq libraries were repooled, if necessary, to perform deep sequencing on an Illumina NovaSeq 6000 instrument with a target of 50,000-100,000 reads per cell.

Analysis workflow

Consensus long-reads for each variant library were generated using SMRT link software (Ardui et al. 2018). Reads were aligned to the reference sequence for the target gene and variant barcodes were identified using constant sequences flanking the left and right of the barcode (Farrar 2007). To account for potential sequencing errors, each variant barcode was clustered using umi_tools.network (Smith et al. 2017). The set of reads corresponding to each variant barcode cluster was then examined to identify single nucleotide variants in the coding region of the target gene. For the variant to variant barcode relationship to pass quality control, the most common mutation had to be at least 50% of the total reads for a variant barcode cluster and match the designed SNV for that barcode. Additionally, we eliminated variant to variant barcode relationships from subsequent analyses where 20% or more of the reads were shorter than the expected length of the target gene by 50 bp or more.

To quantify gene expression and identify cell barcodes, CellRanger’s (version 3.1.0) mkfastq, count, and aggr tools were used for the transformation of bcl files into count matrices. To associate variants to cells, extracted variant barcodes (8 bp) were extracted from Read 2 of the VCB library as well as 10X cell barcodes (16 bp) and transcript unique molecular identifier (UMI) (12 bp) sequences from Read 1. The reference sets of variant barcodes and cell barcodes were defined from the variant to variant barcode analysis above and the filtered matrix output of CellRanger counts, respectively (Zheng et al. 2017). To account for sequencing errors, 1 bp discrepancies between the reference and extracted barcodes were allowed. For each cell barcode, unique transcript UMIs were extracted and clustered using the UMIClustering adjacency method from umi_tools.network. Valid cell to variant barcode associations fulfilled the following criteria: (1) the number of unique transcript UMIs for a given cell barcode was ≥ 3; (2) a single variant barcode achieved a > 0.5 fraction of the unique transcript UMIs for a given cell barcode; (3) the variant barcode with the second highest fraction of unique transcript UMIs did not exceed 0.25.

To remove cells of low quality, we used HDBSCAN (McInnes and Healy 2017) to cluster cells based on the following features: the total number of UMI per cell, the total number of UMI in mitochondrial genes per cell, the percentage of mitochondrial UMI in cells, the total number of genes with non zero values per cell and the total number of UMI in the highly expressed genes per cell. Next, the proportion of cells with < 5000 UMI for each cluster was calculated. Cells were removed if they were assigned to a cluster where the proportion of < 5000 UMI cells in the cluster was < 0.2.

After the data preprocessing was done, expression data for cells that passed QC and had variants identified was used for model training. The labels used for training included pathogenic and benign variants classified by Invitae Corp. and also a subset of ClinVar submissions (https://www.ncbi.nlm.nih.gov/clinvar/). Genes were removed that did not have expression in a minimum of 10% of cells for at least one variant. The mean expression of each gene, within the cells of each variant, was used as the input for modeling. Row sum normalization was used on the cells, features were selected using the index of dispersion as described by Zheng et al. (2017) and PCA was utilized to reduce dimensions. Sklearn (Pedregosa et al. 2011) was used to train models (e.g., random forests, logistic regression and SVM) and optimize hyperparameters through 3 repetitions of 5-fold cross-validation. The highest performing model was calibrated using sklearn’s CalibratedClassifierCV and leave-one-out-cross-validation (LOOCV) was repeated 10 times to produce the final pathogenicity scores for each variant.

Subject data and clinical variant classification

This study used germline DNA variant data from individuals referred by clinicians for diagnostic genetic testing for hereditary disorders. Clinical variant classification was performed utilizing a validated system, Sherloc, based on the ACMG/AMP guidelines (Nykamp et al. 2017). Sherloc uses a semiquantitative point-based rubric for evaluating variant type, allele frequency, and clinical, functional and computational evidence. A likely benign classification and benign classification require 3 benign points and 5 benign points respectively, while a likely pathogenic classification and pathogenic classification require 4 pathogenic and 5 pathogenic points respectively. Clinical reports were generated following professional guidelines and included details on variant classifications and their corresponding evidence.

Results

Framework for evaluating MAVE data

To generate, evaluate and incorporate various types of machine learning models, we developed a single Evidence Modeling Platform (manuscript in preparation). In the context of MAVE-based models, this platform uses supervised machine learning on experimental features from cellular studies to develop a gene-specific model for predicting variant pathogenicity (Fig. 1). Models that achieved high performance in discriminating between known pathogenic and benign variants (AUROC ≥ 0.8) were deemed valid. The output of these validated models, i.e., quantitative variant pathogenicity scores ranging from 0 (benign) to 1 (pathogenic), were further calibrated by calculating negative predictive value (NPV) and positive predictive value (PPV) using the known pathogenic and benign variants. Evidence weights (i.e., points in Sherloc) to variant pathogenicity scores were based on NPV and PPV thresholds. To determine the final classification, variants with this type of functional evidence were subjected to the full variant classification process by clinical genomic scientists and licensed laboratory directors.

Experimental datasets from 66 genes were evaluated with our machine-learning platform. These datasets were either generated within our functional genomics laboratory (44 genes) or were obtained through publications from external groups (22 genes) (Findlay et al. 2018; Richardson et al. 2021; Jia et al. 2021; Glazer et al. 2020; Kato 2023; Giacomelli et al. 2018; Kotler et al. 2018; Weile et al. 2017; Sun et al. 2020; Majithia 2016; Mighell et al. 2018; Matreyek et al. 2018; Brenan et al. 2016; Bandaru et al. 2017; Newberry et al. 2020; Chiasson et al. 2020; Starita et al. 2013; Melamed et al. 2013; Starita et al. 2015; Raraigh et al. 2018; Araya et al. 2012; Amorosi et al. 2021). Evaluating each MAVE dataset was critical because the performance of the resulting predictive models varied widely (Supplementary Fig. 1). Of the 44 datasets produced by our laboratory using single-cell RNA sequencing, 19 yielded a predictive model that both met a performance threshold of AUROC ≥ 0.8 and were selected for integration into Sherloc (Supplementary Table 1). Unlike cell type identification via unsupervised clustering of scRNA-seq profiles, cells harboring pathogenic or benign variants are often intermixed, although analogous clustering at the variant level highlights the signals leveraged by the machine learning models to accurately classify variants (Supplemental Fig. 2). Performant models were achieved for genes across multiple biological pathways and both loss-of-function and gain-of-function disease mechanisms (Fig. 2). Of the external MAVE datasets from 22 genes evaluated, 5 predictive models (for BRCA1, BRCA2, MSH2, SCN5A and TP53) each met a performance threshold of AUROC ≥ 0.8 (Fig. 3, Supplementary Table 2). The majority of the remaining, unintegrated datasets either did not have enough known pathogenic and benign variants to allow assessment or showed insufficient ability to discriminate between benign and pathogenic variants (AUROC < 0.8) and were excluded from further assessments in this manuscript (Supplementary Table 3).

Performance-based integration into a variant classification framework

To ensure that appropriate weight was accorded to the evidence generated by each performant model for the purpose of variant classification, seven tiers were devised based on the NPV and PPV of each model and to mirror existing Sherloc criteria for incorporating functional experimental data (Supplementary Table 4). The first two tiers were defined as [highly predictive benign] and [moderately predictive benign] and were awarded 2 and 1 benign points, respectively. The next two tiers were [moderately predictive pathogenic] and [highly predictive pathogenic] and were awarded 1 and 2 pathogenic points, respectively. The predictive performance thresholds for these tiers were respectively defined as (1) [highly predictive benign] > 95% NPV, (2) [moderately predictive benign] ≥ 80–95% NPV, (3) [moderately predictive pathogenic] ≥ 80–95% PPV, (4) [highly predictive pathogenic] > 95% PPV. The fifth tier corresponded to predictions below 80% PPV and below 80% NPV, which were deemed insufficiently certain to be assigned a weight within the Sherloc scoring system. As the TP53 model was both highly predictive and developed from multiple distinct functional readouts (Kato 2003; Giacomelli et al. 2018; Kotler et al. 2018), it was awarded the final two tiers, [very highly predictive benign, > 97.5% NPV] and [very highly predictive pathogenic, > 97.5%], worth 2.5 benign points and 2.5 pathogenic points, respectively.

Variant reclassification and impact

Within 24 genes for which available datasets yielded a performant predictive model, there were 4043 observed variants with sufficiently confident predictions (≥ 80% NPV or ≥ 80%PPV) to potentially impact Sherloc scoring (Table 1, Supplementary Table 5). To understand the impact of adding this evidence, we reevaluated a subset of variants classified as VUS (n = 3474) for which Sherloc points were applicable as a result of a MAVE-based model. Across genes, we observed an average VUS reclassification rate of 12.6% (436/3474) (Table 2), which included 127 unique variant upgrades from VUS to likely pathogenic/pathogenic and 309 downgrades from VUS to likely benign/benign. MAVE models also contributed to upgrading classifications for 44 variants from likely pathogenic to pathogenic, and downgrading 43 from likely benign to benign. As of Q1 2024, approximately 57,096 patient clinical reports contain variants that have evidence from these cellular models. Within these impacted reports, 38,614 have variants receiving benign evidence and 19,417 have variants receiving pathogenic evidence, with a small number of reports containing multiple variants receiving evidence.

Table 1 Number of unique variants per gene and corresponding Sherloc weighted scores matching predictive performance tiers

Full size table

Table 2 Reclassification of variants with the incorporation of evidence from MAVE datasets

Full size table

Discussion

Historically, genetic testing laboratories have obtained functional experimental data from the scientific literature as one of many types of evidence that contribute to variant classification. In addition to being highly distributed, the evaluation of these low-throughput assays was generally not well-standardized, qualitative in nature and prone to subjectivity, exacerbating the challenges of leveraging experimental data for variant classification. The advent of MAVE technologies presents new opportunities for genetic testing laboratories to both systematically evaluate and incorporate functional evidence into variant classification as well as to potentially generate their own experimental data. Genetic testing laboratories have large amounts of clinical data and expertise that can inform experimental design and enable targeting of genes and variants that are most likely to benefit from new functional data. To date, however, the translation of these MAVE experiments into large-scale variant classification frameworks, and therefore their impact on patients, has been limited.

This challenge is well-suited for a machine learning platform that can build evidence models with MAVE data as features and known pathogenic and benign variants as training labels. Using this approach, we analyzed MAVE datasets for 66 genes and observed that models for 24 genes could be generated that met the quality thresholds to be utilized for clinical variant classification. Many factors likely contributed to the observed performance of each model. For example, many MAVEs have characterized a single region of the gene or one aspect of a gene’s function, whereas pathogenicity is often associated with multiple molecular functions. Additionally, many published datasets included very few clinically evaluated variants in the assay with which to gauge performance. Our own experiences generating MAVE data indicate that gene expression is not a universally appropriate readout and the choice of cell type is also an important consideration. Moreover, the experiments described here were conducted in normal growth conditions and without cofactors, which may be insufficient to reveal pathogenic phenotypes. Whatever the underlying cause, the number of models that did not meet our quality thresholds caution against naively using this evidence in clinical settings and highlight the importance of a rigorous process for quality control.

Combining multiple MAVE datasets using machine learning also has the potential to enhance performance when compared to models derived from individual datasets, as this strategy mitigates the limitation highlighted above that a single MAVE dataset may only measure one aspect of gene function. Therefore, a single model trained with multiple overlapping datasets may result in improved positive and/or negative predictive values, thereby making it better suited for variant classification. Indeed, this is what we and others (Fayer et al. 2021) have observed for TP53. PTEN represents another opportunity for such combined modeling, as multiple functional datasets exist and additional benign variants have been classified in recent years. Although existing guidelines (Fortuno et al. 2021) for incorporating TP53 MAVE datasets into clinical variant classification also attempt to address these limitations, they are challenging to implement at scale and difficult to envision as a robust strategy for handling the increasing volume of MAVE datasets being released. For example, the guidelines are only applicable to TP53, so cannot be specifically utilized for other genes, and any new TP53 MAVE datasets cannot be readily integrated into the existing workflow. As new MAVE datasets will continue to be generated both internally and externally, machine learning allows for a more streamlined and efficient integration of performant models into clinical variant classification, decreasing the time from data generation to clinical impact. Moreover, models can be continuously evaluated, compared and updated using a level playing field and not by subjective assessment.

One distinct aspect of the approach described here is the method of assigning an evidence strength to the prediction scores provided by the MAVE models. In this approach, we determined predictive value bins for each variant score, and the evidence weight assigned to a prediction within the Sherloc classification framework scaled with the predictive value. Consistent with joint ACMG/AMP guidelines, we set the maximum weight of evidence from the MAVE models at 2.5 points, or the equivalent of Strong Evidence described in the ACMG/AMP framework. As a result, additional non-functional evidence is required to reach a definitive likely benign or likely pathogenic classification. In 2019, the ClinGen Sequence Variant Interpretation Working Group further recommended utilizing odds of pathogenicity (OddsPath) as a metric for assigning evidence strength to functional datasets (Brnich et al. 2020). Both our strategy and OddsPath are important quality control measures and seek to assess the performance of the evidence and to scale weighting in variant classification frameworks by the performance. In contrast to the OddsPath recommendations, our approach explicitly allows for multiple evidence weight bins for a given MAVE model. Although reclassification rates are influenced by a number of factors (see below), we have generally observed lower rates upon integrating models built from common datasets. Nevertheless, we have used this strategy for all model types within the Evidence Modeling Platform to significantly reduce VUS rates across a wide diversity of genes without impacting the quality of the classifications (Chen et al. 2023).

With our integration methods, MAVE models enabled reclassification of 12.6% (436/3474) of VUSs observed by our laboratory. Other studies (Fayer et al. 2021; Kim et al. 2020; Scott et al. 2022) have demonstrated a wide range of reclassification rates (0.11 − 74%) owing to the use of MAVE-derived evidence. A point to consider when reviewing reclassification rates is that the baseline amount of evidence that a clinical lab has on a variant will impact the rate of VUS reclassification. This can vary from laboratory to laboratory based on a number of different factors, including the number of patient samples tested as well as the sources of functional evidence. The reclassification rate at any given point in time is also an incomplete measure of the impact of MAVE models. For the many VUS in which reclassification has not yet been achieved, the application of this evidence will lessen the amount of additional evidence required for them to reach a terminal classification and thereby reduce the time they would otherwise spend as a VUS.

As more healthcare providers utilize genetic testing for diagnostic confirmation and treatment decisions, the continued development of scalable and innovative approaches for resolving VUSs is critical. Although MAVEs represent such an approach and have been commonly used for academic research purposes, they have not yet reached their full potential in diagnostic testing settings directly. When used with proper safeguards and validation procedures, the benefits of broadly utilizing this class of evidence in clinical variant classification are clear. Indeed, approximately 57,096 patients have a report with variants impacted by the 24 MAVE models integrated here. Importantly, however, the value of MAVE data extends far beyond providing additional evidence for germline variant classification. Insights derived from MAVE experiments can also contribute to a deeper understanding of disease mechanisms as well as help to guide somatic variant classification, selection of therapeutic interventions and drug development.

Data availability

Code to reproduce the analysis has not been deposited in a public repository because it is proprietary, but can be made available from the corresponding author on request. Additionally, all variants have been shared with ClinVar in a de-identified manner: https://www.ncbi.nlm.nih.gov/clinvar/submitters/500031/.

References

Amorosi CJ, Chiasson MA, McDonald MG, Wong LH, Sitko KA, Boyle G, Kowalski JP, Rettie AE, Fowler DM, Dunham MJ (2021) Massively parallel characterization of CYP2C9 variant enzyme activity and abundance. Am J Hum Genet 108:1735–1751. https://doi.org/10.1016/j.ajhg.2021.07.001
Article CAS PubMed PubMed Central Google Scholar
Araya CL, Fowler DM, Chen W, Muniez I, Kelly JW, Fields S (2012) A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proc Natl Acad Sci U S A 109:16858–16863. https://doi.org/10.1073/pnas.1209751109
Article PubMed PubMed Central Google Scholar
Ardui S, Ameur A, Vermeesch JR, Hestand MS (2018) Single molecular real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res 46:2159–2168. https://doi.org/10.1093/nar/gky066
Article CAS PubMed PubMed Central Google Scholar
Bandaru P, Shah NH, Bhattacharyya M, Barton JP, Kondo Y, Cofsky JC, Gee CL, Chakraborty AK, Kortemme T, Ranganathan R et al (2017) Deconstruction of the Ras switching cycle through saturation mutagenesis. Elife Jul 7:6e27810. https://doi.org/10.7554/eLife.27810
Article Google Scholar
Brenan L, Andreev A, Cohen O, Pantel S, Kamburov A, Cacchiarelli D, Persky NS, Zhu C, Bagul M, Goetz EM et al (2016) Phenotypic characterization of a Comprehensive Set of MAPK1/ERK2 missense mutants. Cell Rep 17:1171–1183. https://doi.org/10.1016/j.celrep.2016.09.061
Article CAS PubMed PubMed Central Google Scholar
Brnich SE, Abou Tayoun AN, Couch FJ, Cutting GR, Greenblatt MS, Heinen CD, Kanavy DM, Luo X, McNulty SM, Starita LM et al (2020) Recommendations for application of the functional evidence PS3/BS3 criterion using the ACMG/AMP sequence variant interpretation framework. Genome Med 21:3. https://doi.org/10.1186/s13073-019-0690-2
Article Google Scholar
Burke W, Parens E, Chung WK, Berger SM, Appelbaum PS (2022) The challenge of genetic variants of Uncertain Clinical significance: a narrative review. Ann Intern Med 175:994–1000. https://doi.org/10.7326/M21-4109
Article PubMed PubMed Central Google Scholar
Chen E, Facio FM, Aradhya KW, Rojahn S, Hatchell KE, Aguilar S, Ouyang K, Saitta S, Hanson-Kwan AK, Capurro NN et al (2023) Rates and classification of variants of Uncertain significance in Hereditary Disease Genetic Testing. JAMA Netw Open 6:e2339571. https://doi.org/10.1001/jamanetworkopen.2023.39571
Article PubMed PubMed Central Google Scholar
Chiasson MA, Rollins NJ, Stephany JJ, Sitko KA, Matreyek KA, Verby M, Song S, Roth PR, DeSloover D, Marks DS et al (2020) Multiplex measurement of variant abundance and activity reveals VKOR topology, active site and human variant impact. Elife Sep 1:9:e58026. https://doi.org/10.7554/eLife.58026
Article Google Scholar
Farrar M (2007) Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23:156–161. https://doi.org/10.1093/bioinformatics/btl582
Article CAS PubMed Google Scholar
Fayer S, Horton C, Dines JN, Rubin AF, Richardson ME, McGoldrick K, Hernandez F, Pesaran T, Karam R, Shirts BH et al (2021) Closing the gap: systematic integration of multiplexed functional data resolves variants of uncertain significance in BRCA1, TP53, and PTEN. Am J Hum Genet 108:2248–2258. https://doi.org/10.1016/j.ajhg.2021.11.001
Article CAS PubMed PubMed Central Google Scholar
Findlay GM, Daza RM, Martin B, Zhang MD, Leith AP, Gasperini M, Janizek JD, Huang X, Starita LM, Shendure J (2018) Accurate classification of BRCA1 variants with saturation genome editing. Nature 562:217–222. https://doi.org/10.1038/s41586-018-0461-z
Article CAS PubMed PubMed Central Google Scholar
Fortuno C, Lee K, Olivier M, Pesaran T, Mai PL, de Andrade KC, Attardi LD, Crowley S, Evans DG, Feng BJ et al (2021) Specifications of the ACMG/AMP variant interpretation guidelines for germline TP53 variants. Hum Mutat 42:223–236. https://doi.org/10.1002/humu.24152
Article CAS PubMed Google Scholar
Fowler DM, Rehm HL (2024) Will variants of uncertain significance still exist in 2030? Am J Hum Genet 111:5–10. https://doi.org/10.1016/j.ajhg.2023.11.005
Article CAS PubMed Google Scholar
Giacomelli AO, Yang X, Lintner RE, McFarland JM, Duby M, Kim J, Howard TP, Takeda DY, Ly SH, Kim E et al (2018) Mutational processes shape the landscape of TP53 mutations in human cancer. Nat Genet 50:1381–1387. https://doi.org/10.1038/s41588-018-0204-y
Article CAS PubMed PubMed Central Google Scholar
Glazer AM, Wada Y, Muhammad A, Kalash OR, O’Neill MJ, Shields T, Hall L, Short L, Blair MA, Kroncke BM et al (2020) High-throughput reclassification of SCN5A variants. Am J Hum Genet 107:111–123. https://doi.org/10.1016/j.ajhg.2020.05.015
Article CAS PubMed PubMed Central Google Scholar
Hasle N, Matreyek KA, Fowler DM (2019) The Impact of Genetic Variants on PTEN Molecular Functions and Cellular phenotypes. Cold Spring Harb Perspect Med 9:a036228. https://doi.org/10.1101/cshperspect.a036228
Article CAS PubMed PubMed Central Google Scholar
Jia X, Burungula BB, Chen V, Lemons RM, Jayakody S, Maksutova M, Kitzman JO (2021) Massively parallel functional testing of MSH2 missense variants conferring Lynch syndrome risk. Am J Hum Genet 108:163–175. https://doi.org/10.1016/j.ajhg.2020.12.003
Article CAS PubMed Google Scholar
Kato S, Han SY, Liu W, Otsuka K, Shibata H, Kanamaru R, Ishioka C (2003) Understanding the function-structure and function-mutation relationships of p53 tumor suppressor protein by high-resolution missense mutation analysis. Proc Natl Acad Sci U S A 100:8424–8429. https://doi.org/10.1073/pnas.1431692100
Article CAS PubMed PubMed Central Google Scholar
Kim HK, Lee EJ, Lee YJ, Kim J, Kim Y, Kim K, Lee SW, Chang S, Lee YJ, Lee JW et al (2020) Impact of proactive high-throughput functional assay data on BRCA1 variant interpretation in 2684 patients with breast or ovarian cancer. J Hum Genet 65:209–220. https://doi.org/10.1038/s10038-019-0713-2
Article CAS PubMed Google Scholar
Kotler E, Shani O, Goldfeld G, Lotan-Pompan M, Tarcic O, Gershoni A, Hopf TA, Marks DS, Oren M, Segal E (2018) A systematic p53 mutation Library Links Differential Functional Impact to Cancer Mutation Pattern and Evolutionary Conservation. Mol Cell 71:178–190. https://doi.org/10.1016/j.molcel.2018.06.012
Article CAS PubMed Google Scholar
Kwasnieski JC, Mogno I, Myers CA, Corbo JC, Cohen BA (2012) Complex effects of nucleotide variants in a mammalian cis-regulatory element. Proc Natl Acad Sci U S A 109:19498–19503. https://doi.org/10.1073/pnas.1210678109
Article PubMed PubMed Central Google Scholar
Majithia AR, Tsuda B, Agostini M, Gnanapradeepan K, Rice R, Peloso G, Patel KA, Zhang X, Broekema MF, Patterson N et al (2016) Prospective functional classification of all possible missense variants in PPARG. Nat Genet 48:1570–1755. https://doi.org/10.1038/ng.3700
Article CAS PubMed PubMed Central Google Scholar
Matreyek KA, Starita LM, Stephany JJ, Martin B, Chiasson MA, Gray VE, Kircher M, Khechaduri A, Dines JN, Hause RJ et al (2018) Multiplex assessment of protein variant abundance by massively parallel sequencing. Nat Genet 50:874–882. https://doi.org/10.1038/s41588-018-0122-z
Article CAS PubMed PubMed Central Google Scholar
McInnes L, Healy J (2017) Accelerated Hierarchical Density Based Clustering. IEEE International Conference on Data Mining Workshop (ICDMW), New Orleans, LA, USA, 32–42. https://doi.org/10.1109/ICDMW.2017.12
Melamed D, Young DL, Gamble CE, Miller CR, Fields S (2013) Deep mutation scanning of an RRM domain of the Saccharomyces cerevisiae poly(A)-binding protein. RNA 19:1537–1551. https://doi.org/10.1261/rna.040709.113
Article CAS PubMed PubMed Central Google Scholar
Mighell TL, Evans-Dutson S, O’Roark BJ (2018) A saturation Mutagenesis Approach to understanding PTEN lipid phosphatase activity and genotype-phenotype relationships. Am J Hum Genet 102:943–955. https://doi.org/10.1016/j.ajhg.2018.03.018
Article CAS PubMed PubMed Central Google Scholar
Newberry RW, Arhar T, Costello J, Hartoularos GC, Maxwell AM, Naing ZZC, Pittman M, Reddy NR, Schwarz DMC, Wassarman DR et al (2020) Robust sequence determinants of alpha-synuclein toxicity in yeast implicate membrane binding. ACS Chem Biol 15:2137–2153. https://doi.org/10.1021/acschembio.0c00339
Article CAS PubMed PubMed Central Google Scholar
Nykamp K, Anderson M, Powers M, Garcia J, Herrera B, Ho YY, Kobayashi Y, Patil N, Thusberg J, Westbrook M et al (2017) Sherloc: a comprehensive refinement of the ACMG-AMP variant classification criteria. Genet Med 19:1105–1117. https://doi.org/10.1038/gim.2017.37
Article PubMed PubMed Central Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Google Scholar
Raraigh KS, Han ST, Davis E, Evans TA, Pellicore MJ, McCague AF, Joynt AT, Lu Z, Atalar M, Sharma N, Sheridan MB, Sosnay PR, Cutting GR (2018) Functional assays are essential for interpretation of Missense Variants Associated with Variable Expressivity. Am J Hum Genet 102(6):1062–1077. https://doi.org/10.1016/j.ajhg.2018.04.003
Article CAS PubMed PubMed Central Google Scholar
Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, Grody WW, Hegde M, Lyon E, Spector E et al (2015) Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17:405–424. https://doi.org/10.1038/gim.2015.30
Article PubMed PubMed Central Google Scholar
Richardson ME, Hu C, Lee KY, LaDuca H, Fulk K, Durda KM, Deckman AM, Goldgar DE, Monteiro ANA, Gnanaolivu R et al (2021) Strong functional data for pathogenicity or neutrality classify BRCA2 DNA-binding-domain variants of uncertain significance. Am J Hum Genet 108:458–468. https://doi.org/10.1016/j.ajhg.2021.02.005
Article CAS PubMed PubMed Central Google Scholar
Romero PA, Tran TM, Abate AR (2015) Dissecting enzyme function with microfluidic-based deep mutational scanning. Proc Natl Acad Sci U S A 112:7159–7164. https://doi.org/10.1073/pnas.1422285112
Article CAS PubMed PubMed Central Google Scholar
Scott A, Hernandez F, Chamberlain A, Smith C, Karam R, Kitzman JO (2022) Saturation-scale functional evidence supports clinical variant interpretation in Lynch syndrome. Genome Biol 23:266. https://doi.org/10.1186/s13059-022-02839-z
Article CAS PubMed PubMed Central Google Scholar
Smith T, Heger A, Sudbery I (2017) UMI-tools: modeling sequencing errors in Unique Molecular identifiers to improve quantification accuracy. Genome Res 27:491–499. https://doi.org/10.1101/gr.209601.116
Article CAS PubMed PubMed Central Google Scholar
Starita LM, Pruneda JN, Lo RS, Fowler DM, Kim HJ, Hiatt JB, Shendure J, Brzovic PS, Fields S, Klevit RE (2013) Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc Natl Acad Sci U S A 110:E1263–1272. https://doi.org/10.1073/pnas.1303309110
Article CAS PubMed PubMed Central Google Scholar
Starita LM, Young DL, Islam M, Kitzman JO, Gullingsrud J, Hause RJ, Fowler DM, Parvin JD, Shendure J, Fields S (2015) Massively parallel functional analysis of BRCA1 RING domain variants. Genetics 200:413–422. https://doi.org/10.1534/genetics.115.175802
Article CAS PubMed PubMed Central Google Scholar
Sun S, Weile J, Verby M, Wu Y, Wang Y, Cote AG, Fotiadou I, Kitaygorodsky J, Vidal M, Rine J et al (2020) A proactive genotype-to-patient phenotype map for cystathionine beta-synthase. Genome Med 12:13. https://doi.org/10.1186/s13073-020-0711-1
Article CAS PubMed PubMed Central Google Scholar
Ursu O, Neal JT, Shea E, Thakore PI, Jerby-Arnon L, Nguyen L, Dionne D, Diaz C, Bauman J, Mossad MM et al (2022) Massively parallel phenotyping of coding variants in cancer with Perturb-Seq. Nat Biotechnol 40:896–905. https://doi.org/10.1038/s41587-021-01160-7
Article CAS PubMed Google Scholar
Weile J, Sun S, Cote AG, Knapp J, Verby M, Mellor JC, Wu Y, Pons C, Wong C, van Lieshout N et al (2017) A framework for exhaustively mapping functional missense variants. Mol Syst Biol 13:957. https://doi.org/10.15252/msb.20177908
Article CAS PubMed PubMed Central Google Scholar
Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J et al (2017) Massively parallel digital transcriptional profiling of single cells. Nat Commun 8:14049. https://doi.org/10.1038/ncomms14049
Article CAS PubMed PubMed Central Google Scholar

Download references

Funding

The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Author information

Andrew Keyser
Present address: Calico Life Sciences, South San Francisco, CA, 94080, USA
Paola Alejandra Correa Nunez
Present address: Gilead Life Sciences Inc, Foster City, CA, 94404, USA
John M. Nicoludis
Present address: Department of Structural Biology, Genentech, South San Francisco, CA, 94080, USA
Britt Johnson
Present address: GeneDx, Stamford, CT, 06902, USA
Carlos L. Araya
Present address: Tapanti.org, Santa Barbara, CA, 93108, USA

Authors and Affiliations

Invitae Corporation, San Francisco, CA, 94103, USA
Samskruthi Reddy Padigepati, David A. Stafford, Christopher A. Tan, Melanie R. Silvis, Kirsty Jamieson, Andrew Keyser, Paola Alejandra Correa Nunez, John M. Nicoludis, Toby Manders, Laure Fresard, Yuya Kobayashi, Carlos L. Araya, Swaroop Aradhya, Britt Johnson, Keith Nykamp & Jason A. Reuter
Epic Bio, South San Francisco, CA, 94080, USA
Melanie R. Silvis & Kirsty Jamieson

Authors

Samskruthi Reddy Padigepati
View author publications
You can also search for this author in PubMed Google Scholar
David A. Stafford
View author publications
You can also search for this author in PubMed Google Scholar
Christopher A. Tan
View author publications
You can also search for this author in PubMed Google Scholar
Melanie R. Silvis
View author publications
You can also search for this author in PubMed Google Scholar
Kirsty Jamieson
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Keyser
View author publications
You can also search for this author in PubMed Google Scholar
Paola Alejandra Correa Nunez
View author publications
You can also search for this author in PubMed Google Scholar
John M. Nicoludis
View author publications
You can also search for this author in PubMed Google Scholar
Toby Manders
View author publications
You can also search for this author in PubMed Google Scholar
Laure Fresard
View author publications
You can also search for this author in PubMed Google Scholar
Yuya Kobayashi
View author publications
You can also search for this author in PubMed Google Scholar
Carlos L. Araya
View author publications
You can also search for this author in PubMed Google Scholar
Swaroop Aradhya
View author publications
You can also search for this author in PubMed Google Scholar
Britt Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Keith Nykamp
View author publications
You can also search for this author in PubMed Google Scholar
Jason A. Reuter
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

DS, MRS, KJ, AK, CLA and JAR designed and conducted experiments. SP, CLA, MRS, KJ, PACN and JMN analyzed data and generated models. JAR, CAT and SP drafted the manuscript. SP, DS, CAT, MRS, KJ, AK, PACN, JMN, TM, LF, YK, CLA, SA, BJ, KN and JAR. reviewed and edited the manuscript draft. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Keith Nykamp or Jason A. Reuter.

Ethics declarations

Ethics approval

This study was performed in line with the principles of the Declaration of Helsinki. The use of de-identified genetic data was approved under the Western Institutional Review Board protocol number 1167406.

Consent to participate

Not applicable.

Consent to publish

Not applicable.

Competing interests

SP, DS, CAT, TM, LF, YK, SA, BJ, KN and JAR. are employees and shareholders of Invitae. MRS and KJ are former employees of Invitae and current employees of Epic Bio. AK is a former employee of Invitae and current employee of Calico Life Sciences. PACN is a former employee of Invitae and current employee of Gilead Life Sciences Inc. JMN is a former employee of Invitae and current employee of Genentech. CLA is a former employee of Invitae.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Padigepati, S.R., Stafford, D.A., Tan, C.A. et al. Scalable approaches for generating, validating and incorporating data from high-throughput functional assays to improve clinical variant classification. Hum. Genet. 143, 995–1004 (2024). https://doi.org/10.1007/s00439-024-02691-0

Download citation

Received: 22 April 2024
Accepted: 12 July 2024
Published: 01 August 2024
Issue Date: August 2024
DOI: https://doi.org/10.1007/s00439-024-02691-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Scalable approaches for generating, validating and incorporating data from high-throughput functional assays to improve clinical variant classification

Abstract

Similar content being viewed by others

Best practices for variant calling in clinical sequencing

Variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics

Assigning credit where it is due: an information content score to capture the clinical value of multiplexed assays of variant effect

Introduction

Materials and methods