Predicting the impact of rare variants on RNA splicing in CAGI6

Lord, Jenny; Oquendo, Carolina Jaramillo; Wai, Htoo A.; Douglas, Andrew G. L.; Bunyan, David J.; Wang, Yaqiong; Hu, Zhiqiang; Zeng, Zishuo; Danis, Daniel; Katsonis, Panagiotis; Williams, Amanda; Lichtarge, Olivier; Chang, Yuchen; Bagnall, Richard D.; Mount, Stephen M.; Matthiasardottir, Brynja; Lin, Chiaofeng; Hansen, Thomas van Overeem; Leman, Raphael; Martins, Alexandra; Houdayer, Claude; Krieger, Sophie; Bakolitsa, Constantina; Peng, Yisu; Kamandula, Akash; Radivojac, Predrag; Baralle, Diana

doi:10.1007/s00439-023-02624-3

Predicting the impact of rare variants on RNA splicing in CAGI6

Original Investigation
Open access
Published: 03 January 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Human Genetics Aims and scope Submit manuscript

Predicting the impact of rare variants on RNA splicing in CAGI6

Download PDF

Jenny Lord ORCID: orcid.org/0000-0002-0539-9343¹,
Carolina Jaramillo Oquendo¹,
Htoo A. Wai¹,
Andrew G. L. Douglas^1,2,
David J. Bunyan^1,3,
Yaqiong Wang⁴,
Zhiqiang Hu⁵,
Zishuo Zeng⁶,
Daniel Danis⁷,
Panagiotis Katsonis⁸,
Amanda Williams⁸,
Olivier Lichtarge⁸,
Yuchen Chang^9,10,
Richard D. Bagnall^9,10,
Stephen M. Mount¹¹,
Brynja Matthiasardottir^12,13,
Chiaofeng Lin¹⁴,
Thomas van Overeem Hansen^15,16,
Raphael Leman^17,18,
Alexandra Martins¹⁹,
Claude Houdayer^19,20,
Sophie Krieger^17,18,
Constantina Bakolitsa⁵,
Yisu Peng²¹,
Akash Kamandula²¹,
Predrag Radivojac²¹ &
…
Diana Baralle^1,22

2295 Accesses
1 Citation
9 Altmetric
Explore all metrics

Abstract

Variants which disrupt splicing are a frequent cause of rare disease that have been under-ascertained clinically. Accurate and efficient methods to predict a variant’s impact on splicing are needed to interpret the growing number of variants of unknown significance (VUS) identified by exome and genome sequencing. Here, we present the results of the CAGI6 Splicing VUS challenge, which invited predictions of the splicing impact of 56 variants ascertained clinically and functionally validated to determine splicing impact. The performance of 12 prediction methods, along with SpliceAI and CADD, was compared on the 56 functionally validated variants. The maximum accuracy achieved was 82% from two different approaches, one weighting SpliceAI scores by minor allele frequency, and one applying the recently published Splicing Prediction Pipeline (SPiP). SPiP performed optimally in terms of sensitivity, while an ensemble method combining multiple prediction tools and information from databases exceeded all others for specificity. Several challenge methods equalled or exceeded the performance of SpliceAI, with ultimate choice of prediction method likely to depend on experimental or clinical aims. One quarter of the variants were incorrectly predicted by at least 50% of the methods, highlighting the need for further improvements to splicing prediction methods for successful clinical application.

S-CAP extends pathogenicity prediction to genetic variants that affect RNA splicing

Article 25 February 2019

Comparison of in silico strategies to prioritize rare genomic variants impacting RNA splicing for the diagnosis of genomic disorders

Article Open access 18 October 2021

Introme accurately predicts the impact of coding and noncoding variants on gene splicing, with clinical applications

Article Open access 17 May 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The diagnosis of rare disorders has been revolutionised in recent years thanks to the availability and widespread adoption of next generation sequencing technologies capable of detecting disease-causing variants. With the ever-decreasing prices of whole-exome sequencing (WES) and whole-genome sequencing (WGS) comes an increased use of these approaches, leading to the detection of more genetic variants than ever before. This brings with it a major challenge in understanding what these variants do, since our ability to detect them has far outstripped our ability to meaningfully interpret their effects, particularly outside of protein coding regions. As a result, even with WGS, around half of patients with rare disorders do not get a diagnosis (Turro et al. 2020; Stranneheim et al. 2021).

While estimates vary widely (Lord and Baralle 2021), it is thought somewhere between 15 and 60% of disease causing variants affect splicing (Krawczak et al. 1992; López-Bigas et al. 2005). Generally speaking, in diagnostic and research variant prioritisation pipelines, variants which fall within the 2 bp canonical splice acceptor or donor sites will be classed as splice-affecting, while variants outside of those small regions are often not assessed for splicing impact. It is common for intronic and synonymous variants to be filtered out, while missense variants are generally assessed for their impact on protein structure and function without consideration for the role they may play in splicing. All of these variant types, however, can and do impact splicing and cause disease. This approach has led to an under-ascertainment of splice-affecting variants clinically (Lord et al. 2019). What is needed, particularly with the increasing use of WGS over WES enabling the detection of far more intronic variants than before, is a way to effectively triage which variants are splice-affecting and which are not.

Currently, under ACMG/AMP guidelines (Richards et al. 2015), in silico splicing prediction approaches may be used as supporting evidence for genetic diagnosis if multiple independent tools predict an impact on splicing. Experimental validation of splicing effects using RT-PCR, mini-genes or RNAseq is often required to confidently establish a variant’s impact on splicing, but such approaches are time-consuming and expensive to perform at scale. Recent years have seen a plethora of innovative new approaches to splicing prediction, with many new tools being generated, often utilising machine learning. If a high degree of accuracy and reliability can be obtained from in silico approaches, we may be able to move away from requiring experimental confirmations, or at the least, have an efficient method to triage variants most in need of validation. This would require highly accurate algorithms and extensive testing in the clinical setting to give sufficient confidence in these optimal approaches.

The Splicing Variants of Unknown Significance (VUS) challenge in the 6th Critical Assessment of Genome Interpretation (CAGI6) sought to assess splicing prediction accuracy on a set of clinically ascertained, functionally validated variants. This enabled performance comparison of many cutting-edge splicing prediction approaches and gave insights into the types of variants not currently well-captured by these methods.

Methods

Variant selection and validation

As previously described in Wai et al. 2020 (Wai et al. 2020), a total of 64 variants were ascertained through Wessex Regional Genetics Laboratory in Salisbury (52 variants) or the Splicing and Disease research study (12 variants) at the University of Southampton, ethically approved by the Health Research Authority (IRAS Project ID 49685, REC 11/SC/0269) and by the University of Southampton (ERGO ID 23056). Informed consent was provided for all patients for splicing studies to be conducted. All variants had been, or were undergoing RT-PCR analysis to determine their impact on splicing using RNA from whole blood collected in PAXgene tubes, again as previously described (Wai et al. 2020).

Eight variants were excluded from the final analysis (unable to establish splicing impact before analysis period (n = 3), incorrect gene/variant annotations given in the data set distributed (n = 3), variant found to impact gene expression rather than splicing (n = 2)), giving a total of 56 variants in the final assessment set (Supplementary Table 1), which span a wide range of rare disease and cancer predisposition associations, none of which had had their impact on splicing published previously.

The Splicing VUS challenge

Variants were distributed as a tab delimited text file, including the following information: HGNC identifier, chromosome, position, reference allele, alternative allele, gene and strand. Entrants also had access to 256 previously published variants (Wai et al. 2020) obtained and validated by the same approach to aid in method development/testing.

Challenge participants submitted their entries in the form of tab delimited text files, including the variant information, a binary prediction of whether a variant affected splicing or not (1 = yes, 0 = no), along with a score for the probability of the variant affecting splicing and the level of confidence in the prediction given. All assessments were based on the binary splice-affecting prediction alone.

Challenge assessment

The performance of each prediction model was assessed by calculating and comparing a series of metrics: overall accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV). AUC and confidence intervals (2000 stratified bootstrap replicates) were calculated using the pROC package (Robin et al. 2011) in R v3.5.1 (R Core Team 2018), and plots made with ggplot2 (Wickham 2009). Performance of each method was compared across binned splicing locations—Near Acceptor (acceptor ± 10 bp), Near Donor (donor ± 10 bp), Exonic Distant (exonic, 11 bp or more from either splice site), and Intronic Distant (intronic, 11 bp or more from either splice site. For grouped analyses, exonic distant and intronic distant variants were grouped together due to low numbers). These scores were based on the concordance of the binary classification of the variants provided by each team/model (1 = splice-affecting and 0 = not splice-affecting) with the experimental validation of the splicing impact.

SpliceAI (Jaganathan et al. 2019) and CADD v1.6 (Kircher et al. 2014) (which incorporates SpliceAI predictions) were included in the assessment alongside the challenge models as a comparison to emerging industry standards. CADD-phred scores were obtained by uploading a VCF to the CADD webserver (https://cadd.gs.washington.edu/score). SpliceAI scores were obtained from Ensembl’s Variant Effect Predictor (VEP) web interface (McLaren et al. 2016) (44 variants scored) or using the SpliceAI webserver from the Broad Institute (https://spliceailookup.broadinstitute.org/, 11 variants that were not scored by VEP; options: hg38, masked scores, max distance 50 bp). A cutoff of 0.2 was used for SpliceAI scores, and 18 for CADD.

Results

Variant characteristics of challenge set

Of the 56 variants in the final analysis, the majority (n = 49, 87.5%) were SNVs, with 7 indels (12.5%). The variants fell within 42 different genes, broadly representative of clinical genetics referrals in the UK, with the majority of genes having a single variant in the set, and only 7 genes with >1 variant (BRCA1 n = 6, FBN1 n = 4, MYBPC3 n = 3, BRCA2 n = 2, SCN5A n = 2, APC n = 2, USP7 n = 2). 37 variants (66%) were found to affect splicing, while 19 (34%) had no observable impact.

Variants were divided into 5 groups by their positions relative to intron–exon boundaries. There were 16 variants within 10 bp of a splice acceptor site (NearAcc), and 23 within 10 bp of a splice donor site (NearDon). 10 exonic variants > 10 bp from either splice site were classed as Exonic > 10. Intronic variants > 10 bp from their nearest splice site were termed Intronic Distant (six upstream of the acceptor, one downstream of the donor). The locations of all variants relative to the intron–exon boundary and whether the variants were determined to be splice disrupting or not are given in Fig. 1.

Challenge participants

Eight teams submitted predictions for the challenge, with two teams submitting predictions from multiple models, giving 12 models altogether. Table 1 gives a summary of the approach taken by each model, which was provided by the challenge entrants upon submission of their predictions, but blinded to the assessors until after the assessment period.

Table 1 Summary of the prediction approaches of the 12 models from 8 entrants

Full size table

Model performance across 56 variants

Table 2 summarises the performance metrics of the 12 models, along with CADD and SpliceAI. Full variant information, scores and binary predictions for the 12 models, SpliceAI and CADD and experimental outcome of splicing status are given in Supplementary Table 1. The ROC plots for each model are shown in Fig. 2, and Supplementary Fig. 1 shows the performance of each method on each variant across the splicing region.

Table 2 Summary statistics on predictive performance of the 12 competition entrants plus SpliceAI and CADD on the 56 challenge variants

Full size table

No single approach performed optimally on all assessment metrics (Table 2). Overall accuracy was joint highest in Teams 4 and 8 at 0.82, with Team 4 also achieving the highest binary outcome AUC at 0.839 (Fig. 2). Team 8 ranked highest on the related metrics for sensitivity (0.919) and NPV (0.800), indicating its permissive prediction approach (i.e., favouring sensitivity over specificity). Conversely, Team 5’s Model 2 performed the best in terms of specificity (0.947) and PPV (0.947), with the lowest proportion of false positive findings. All three models by Team 1, plus Team 4 and Team 6 achieved over 70% in both sensitivity and specificity, indicating more balanced performance.

Included as comparators were SpliceAI with a cutoff of 0.2 and CADD with a cutoff of 18. SpliceAI was competitive with the challenge entrants, ranking near-top but not top on all metrics, and indeed top in the AUC when measured using prediction score rather than binary prediction outcome. CADD, however, performed poorly on the challenge set with specificity in particular being very low (0.263).

Performance comparison by variant type

To get an overall impression of the performance of the methods on different types of variants, variants were grouped by location relative to their nearest splice site (Fig. 3), as described in Methods. All methods performed better on exonic distant variants than intronic distant variants, with the exception of SpliceAI, which correctly predicted all seven intronic distant variants. Across methods, there was a high degree of consistency in the proportion of variants correctly predicted in the near acceptor region, and a high degree of variance in performance in the intronic distant set. The types of error differed across regions, with the near acceptor region and exonic distant region having very few false positive predictions across all methods, while almost all methods gave false positive predictions in the near donor and intronic distant regions (Supplementary Fig. 2).

We also compared the performance of the approaches on SNVs vs indels, and found all methods except CADD had higher accuracy on SNVs than indels (Supplementary Fig. 3).

Some variants are consistently mispredicted

21 of the variants (37.5%) were correctly predicted by all 12 submitted prediction methods. None of the variants were incorrectly predicted by all methods, but 14 variants (25%) were predicted correctly by ≤50% of the methods, with two variants only being correctly predicted by a single method. These were a splice-affecting single nucleotide deletion 4 bp from a splice acceptor site in KANSL1 (correctly predicted by Team 3) and an SNV in the last base of an exon in TRPM6 which despite altering the conserved last G nucleotide did not affect splicing in functional testing (correctly predicted by Team 4; Fig. 4).

Discussion

The CAGI6 Splicing VUS challenge assessed the performance of 14 prediction approaches on a set of 56 clinically relevant variants whose impact on splicing had been functionally tested using RT-PCR. A variety of approaches were adopted, and several methods equalled or exceeded the performance of the emergent field leader, SpliceAI.

While Teams 4 and 8 had joint highest overall accuracy, there was no single optimal method for the Splicing VUS challenge, since several different models performed optimally on different metrics. Choice of approach may, therefore, be dependent on the specific nature of the predictions required. Seeking a molecular diagnosis for a particular family may favour sensitivity over specificity, since overlooking a causal variant would prevent this aim, so Team 8’s approach with almost 92% sensitivity may be preferred. Seeking confident splice disrupting candidates for functional validation or mechanistic research may call for greater specificity than sensitivity to avoid wasting resources on false positive variants that do not have an impact, in which case Team 5’s model 2 with almost 95% specificity may be the strategy of choice.

SpliceAI and CADDv1.6 were chosen as comparators for the entrants to the Splicing VUS challenge and were run by the assessors on the 56 challenge variants. SpliceAI has been emerging as a field leader in recent years, with accuracies >90% attained in several studies (Wai et al. 2020; Ha et al. 2021; Strauch et al. 2022), although variable performance reported by some (Riepe et al. 2020) which is more consistent with our observed 80.4% overall accuracy in this study.

CADD did not perform well on the challenge variants, achieving an overall accuracy of 62.5%. However, this was predominantly driven by a very low specificity, which is to be expected from CADD, since it is not only the impact on splicing being assessed by the tool, but overall deleteriousness. For example, missense variants which were not found to affect splicing in the challenge set may still have been pathogenic through impact on protein structure and/or function. For such variants, CADD would accurately classify them as deleterious in general, but in our assessment solely of splicing impact, this would appear as a false positive, lowering CADD’s specificity. Notably, the version of CADD included in the assessment (v1.6) includes SpliceAI and additional splicing prediction tools in its underlying model (Rentzsch et al. 2021). Scoring the challenge variants with CADD v1.5 which did not include these splicing metrics resulted in an overall accuracy around 44.6% (data not shown). From these values it is clear that the explicit inclusion of splicing prediction methods within CADD’s underlying model has improved its ability to predict variants that impact splicing. CADD’s broad approach makes it a versatile tool for prediction of deleteriousness for many different variant types. At present, however, if predicting a variant’s splicing impact is the sole aim, the use of more specialised splicing tools is more appropriate.

Of note, SpliceAI featured heavily across the predictive strategies, being the sole predictive method for Team 6 and contributing heavily to the predictions of Team 4, which were weighted by MAF, as well as being run as a comparator by the assessors. Differences in the performance of these approaches highlight that even with the same nominal method, there can be variance in predictions depending on how the scores are obtained, and the thresholds that are used to determine positive predictions. There were just three approaches that did not include SpliceAI as part of their predictions, two utilising instead recent machine learning-based prediction tools SQUIRLS (Danis et al. 2021) and SPiP (Leman et al. 2022), and one based on the splicing prediction tools available within the Alamut software, which has been widely used in clinical practice. Of the three, SPiP was the only method to achieve greater accuracy than SpliceAI.

A major strength of the challenge in terms of providing a real-world assessment of the performance of these tools is the ascertainment of the challenge variants from genuine clinical practice, where potential splice altering variants in genes relevant to the patient’s presentation were referred for validation. This is precisely the type of variant splicing prediction models should be tested on when assessing their suitability for clinical application in rare disorders. It highlights that even in exceptionally well-studied genes, such as the BRCA genes, challenges in variant interpretation remain, since 3 of 8 variants across BRCA1 and BRCA2 were incorrectly predicted by over half of challenge methods, and only two of these were accurately predicted by all methods. However, the relatively small sample size makes it difficult to draw any major inferences and is a significant limitation of the study. Apparent variance in performance may be stochastic at such a sample size, and may not be fully reflective of overall performance in a wider context. It also made drawing firm conclusions about performance in subsets of the data, e.g., split by location, variant type, or disease group challenging. However, ascertaining a large body of clinical variants, validating the splicing impact and keeping that private, as is needed for a blinded challenge, such as the CAGI6 Splicing VUS challenge, raises ethical concerns. Accurate and timely variant interpretation is reliant on sharing of data, and withholding a large body of functionally validated variants from resources, such as ClinVar (Landrum et al. 2018), which are heavily used in clinical assessment of variants does not represent good practice.

This small but highly clinically relevant challenge assessed the performance of 12 prediction methods plus SpliceAI and CADD on 56 clinically ascertained variants and found SpliceAI weighted by allele frequency and SPiP to be the most accurate overall, while other methods had particular strengths in their sensitivity or specificity. A quarter of variants were incorrectly predicted by half or more of the methods, showing there is still improvement to be made. Furthermore, this challenge was limited to a binary outcome—whether or not splicing was disrupted, but did not address the nature of that disruption. Disruption to splicing is often complex (e.g., multiple different splicing events induced), incomplete (e.g., aberrant and wild-type splicing observed), and can be further complicated by nonsense mediated decay. This will present an even greater challenge for accurate prediction than the binary outcome assessed here. A larger assessment set that would enable further investigation of the types of variants that are consistently incorrectly predicted may help direct efforts for refinement of models moving forwards.

Data availability

All data generated or analysed during this study are included in this published article [and its supplementary information files].

References

Cheng J et al (2019) MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biol 20(1):48
Article PubMed PubMed Central Google Scholar
Danis D, Jacobsen JOB, Carmody LC, Gargano MA, McMurry JA, Hegde A, Haendel MA, Valentini G, Smedley D, Robinson PN (2021) Interpretable prioritization of splice variants in diagnostic next-generation sequencing. Am J Hum Genet 108(9):1564–1577
Article PubMed PubMed Central CAS Google Scholar
Ha C, Kim JW, Jang JH (2021) Performance evaluation of SpliceAI for the prediction of splicing of NF1 variants. Genes (basel) 12:1308
Article PubMed CAS Google Scholar
Jagadeesh KA et al (2019) S-CAP extends pathogenicity prediction to genetic variants that affect RNA splicing. Nat Genet 51(4):755–763
Article PubMed CAS Google Scholar
Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, Darbandi SF, Knowles D, Li YI, Kosmicki JA, Arbelaez J, Cui W, Schwartz GB et al (2019) Predicting splicing from primary sequence with deep learning. Cell 176(3):535–548
Article PubMed CAS Google Scholar
Jian X, Boerwinkle E, Liu X (2014) In silico prediction of splice-altering single nucleotide variants in the human genome. Nucleic Acids Res 42(22):13534–13544
Article PubMed PubMed Central CAS Google Scholar
Karczewski KJ et al (2020) The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581(7809):434–443
Article PubMed PubMed Central CAS Google Scholar
Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM, Shendure J (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46(3):310–315
Article PubMed PubMed Central CAS Google Scholar
Krawczak M, Reiss J, Cooper DN (1992) The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: causes and consequences. Hum Genet 90:41–54
Article PubMed CAS Google Scholar
Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Jang W et al (2018) ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res 46(D1):D1062–D1067
Article PubMed CAS Google Scholar
Leman R, Parfait B, Vidaud D, Girodon E, Pacot L, Le Gac G, Ka C, Ferec C, Fichou Y, Quesnelle C et al (2022) SPiP: Splicing Prediction Pipeline, a machine learning tool for massive detection of exonic and intronic variant effects on mRNA splicing. Hum Mutat 43(12):2308–2323
Article PubMed CAS Google Scholar
López-Bigas N, Audit B, Ouzounis C, Parra G, Guigó R (2005) Are splicing mutations the most frequent cause of hereditary disease? FEBS Lett 579:1900–1903
Article PubMed Google Scholar
Lord J, Baralle D (2021) Splicing in the diagnosis of rare disease: advances and challenges. Front Genet 12:689892
Article PubMed PubMed Central CAS Google Scholar
Lord J, Gallone G, Short PJ, McRae JF, Ironfield H, Wynn EH, Gerety SS, He L, Kerr B, Johnson DS et al (2019) Pathogenicity and selective constraint on variation near splice sites. Genome Res 29:159–170
Article PubMed PubMed Central CAS Google Scholar
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F (2016) The ensembl variant effect predictor. Genome Biol 17(1):122
Article PubMed PubMed Central Google Scholar
R Core Team (2018) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Google Scholar
Rentzsch P, Schubach M, Shendure J, Kircher M (2021) CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Med 13(1):31
Article PubMed PubMed Central CAS Google Scholar
Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, Grody WW, Hegde M, Lyon E, Spector E et al (2015) Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 17:405–424
Article PubMed PubMed Central Google Scholar
Riepe TV, Khan M, Roosing S, Cremers FPM, ‘t Hoen PAC (2020) Benchmarking deep learning splice prediction tools using functional splice assays. Authorea 42:799–810. https://doi.org/10.22541/au.160081230.07101269
Article Google Scholar
Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, Muller M (2011) pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12:77
Article PubMed PubMed Central Google Scholar
Stranneheim H, Lagerstedt-Robinson K, Magnusson M, Kvarnung M, Nilsson D, Lesko N, Engvall M, Anderlid BM, Arnell H, Johansson CB et al (2021) Integration of whole genome sequencing into a healthcare setting: high diagnostic rates across multiple clinical entities in 3219 rare disease patients. Genome Med 13:40
Article PubMed PubMed Central Google Scholar
Strauch Y, Lord J, Niranjan M, Baralle D (2022) CI-SpliceAI-Improving machine learning predictions of disease causing splicing variants using curated alternative splice sites. PLoS ONE 17:e0269159
Article PubMed PubMed Central CAS Google Scholar
Turro E, Astle WJ, Megy K, Graf S, Greene D, Shamardina O, Allen HL, Sanchis-Juan A, Frontini M, Thys C et al (2020) Whole-genome sequencing of patients with rare diseases in a national health system. Nature 583:96–102
Article PubMed PubMed Central CAS Google Scholar
Wai HA, Lord J, Lyon M, Gunning A, Kelly H, Cibin P, Seaby EG, Spiers-Fitzgerald K, Lye J, Ellard S et al (2020) Blood RNA analysis can increase clinical diagnostic rate and resolve variants of uncertain significance. Genet Med 22:1005–1014
Article PubMed PubMed Central CAS Google Scholar
Wickham H (2009) ggplot2 Elegant graphics for data analysis introduction. Use R. Springer, New York. https://doi.org/10.1007/978-0-387-98141-3_1
Article Google Scholar
Yeo G, Burge CB (2004) Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J Comput Biol 11(2–3):377–394
Article PubMed CAS Google Scholar

Download references

Acknowledgements

We thank the CAGI organisers for their commitment to improving variant interpretation and for making this challenge happen. The CAGI experiment is supported by NIH U24 HG007346. We acknowledge the NIHR Clinical Research Network (CRN) in recruiting the participants and the Musketeers Memorandum, as well as support from the NIHR UK Rare Genetic Disease Consortium. The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated support services at the University of Southampton, in the completion of this work. For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.

Funding

The Baralle Lab is supported by the NIHR Research Professorship awarded to DB (RP-2016-07-011). JL is supported by an Anniversary Fellowship from the University of Southampton. Some of the functional validations of variants were funded by a Wessex Medical Research Innovation Grant awarded to JL. RDB is supported by a New South Wales Health Cardiovascular Disease Senior Scientist Grant.

Author information

Authors and Affiliations

Human Development and Health, Faculty of Medicine, University of Southampton, Southampton, UK
Jenny Lord, Carolina Jaramillo Oquendo, Htoo A. Wai, Andrew G. L. Douglas, David J. Bunyan & Diana Baralle
Oxford Centre for Genomic Medicine, Oxford University Hospitals NHS Foundation Trust, Oxford, UK
Andrew G. L. Douglas
Wessex Regional Genetics Laboratory, Salisbury District Hospital, Salisbury, UK
David J. Bunyan
Center for Molecular Medicine, Children’s Hospital of Fudan University, National Children’s Medical Center, Shanghai, 201102, China
Yaqiong Wang
University of California, Berkeley, Berkeley, CA, 94720, USA
Zhiqiang Hu & Constantina Bakolitsa
Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, 08873, USA
Zishuo Zeng
The Jackson Laboratory for Genomic Medicine, 10 Discovery Drive, Farmington, CT, 06032, USA
Daniel Danis
Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
Panagiotis Katsonis, Amanda Williams & Olivier Lichtarge
Agnes Ginges Centre for Molecular Cardiology at Centenary Institute, University of Sydney, Sydney, Australia
Yuchen Chang & Richard D. Bagnall
Faculty of Medicine and Health, University of Sydney, Sydney, Australia
Yuchen Chang & Richard D. Bagnall
Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD, USA
Stephen M. Mount
Graduate Program in Biological Sciences and Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD, USA
Brynja Matthiasardottir
Inflammatory Disease Section, National Human Genome Research Institute, Bethesda, MD, USA
Brynja Matthiasardottir
DNAnexus, Mountain View, CA, 94040, USA
Chiaofeng Lin
Department of Clinical Genetics, University Hospital of Copenhagen, Rigshospitalet, Copenhagen, Denmark
Thomas van Overeem Hansen
Department of Clinical Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
Thomas van Overeem Hansen
Laboratoire de Biologie et Génétique du Cancer, Centre François Baclesse, Caen, France
Raphael Leman & Sophie Krieger
Inserm U1245, Cancer Brain and Genomics, Normandie Université, UNICAEN, FHU G4 génomique, Rouen, France
Raphael Leman & Sophie Krieger
Inserm U1245, Cancer Brain and Genomics, Normandie Université, UNIROUEN, FHU G4 génomique, Rouen, France
Alexandra Martins & Claude Houdayer
Department of Genetics, Univ Rouen Normandie, INSERM U1245, FHU-G4 Génomique and CHU Rouen, 76000, Rouen, France
Claude Houdayer
Khoury College of Computer Sciences, Northeastern University, Boston, MA, 02115, USA
Yisu Peng, Akash Kamandula & Predrag Radivojac
Wessex Clinical Genetics Service, University Hospital Southampton NHS Foundation Trust, Southampton, UK
Diana Baralle

Authors

Jenny Lord
View author publications
You can also search for this author in PubMed Google Scholar
Carolina Jaramillo Oquendo
View author publications
You can also search for this author in PubMed Google Scholar
Htoo A. Wai
View author publications
You can also search for this author in PubMed Google Scholar
Andrew G. L. Douglas
View author publications
You can also search for this author in PubMed Google Scholar
David J. Bunyan
View author publications
You can also search for this author in PubMed Google Scholar
Yaqiong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Zishuo Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Danis
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Katsonis
View author publications
You can also search for this author in PubMed Google Scholar
Amanda Williams
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Lichtarge
View author publications
You can also search for this author in PubMed Google Scholar
Yuchen Chang
View author publications
You can also search for this author in PubMed Google Scholar
Richard D. Bagnall
View author publications
You can also search for this author in PubMed Google Scholar
Stephen M. Mount
View author publications
You can also search for this author in PubMed Google Scholar
Brynja Matthiasardottir
View author publications
You can also search for this author in PubMed Google Scholar
Chiaofeng Lin
View author publications
You can also search for this author in PubMed Google Scholar
Thomas van Overeem Hansen
View author publications
You can also search for this author in PubMed Google Scholar
Raphael Leman
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Martins
View author publications
You can also search for this author in PubMed Google Scholar
Claude Houdayer
View author publications
You can also search for this author in PubMed Google Scholar
Sophie Krieger
View author publications
You can also search for this author in PubMed Google Scholar
Constantina Bakolitsa
View author publications
You can also search for this author in PubMed Google Scholar
Yisu Peng
View author publications
You can also search for this author in PubMed Google Scholar
Akash Kamandula
View author publications
You can also search for this author in PubMed Google Scholar
Predrag Radivojac
View author publications
You can also search for this author in PubMed Google Scholar
Diana Baralle
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

DB and JL conceived of the challenge. AGLD, DJB and JL selected variants to include in the set, which had been functionally validated by HAW and DJB. JL assessed challenge entrants and conducted data analysis. CJO conducted additional analyses and presented the findings at the CAGI6 conference. All further authors submitted prediction methods in response to the challenge. JL drafted the manuscript, with revision suggestions and final approval from all other authors.

Corresponding author

Correspondence to Diana Baralle.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose. On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethics approval

Informed consent was provided for all patients for splicing studies to be conducted. Patients were recruited from Wessex Regional Genetics Laboratory in Salisbury (52 variants) or the Splicing and Disease research study (12 variants) at the University of Southampton, ethically approved by the Health Research Authority (IRAS Project ID 49685, REC 11/SC/0269) and by the University of Southampton (ERGO ID 23056).

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 1209 KB)

Supplementary file2 (XLSX 112 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lord, J., Oquendo, C.J., Wai, H.A. et al. Predicting the impact of rare variants on RNA splicing in CAGI6. Hum. Genet. (2024). https://doi.org/10.1007/s00439-023-02624-3

Download citation

Received: 15 June 2023
Accepted: 18 November 2023
Published: 03 January 2024
DOI: https://doi.org/10.1007/s00439-023-02624-3

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Predicting the impact of rare variants on RNA splicing in CAGI6

Abstract

Similar content being viewed by others

S-CAP extends pathogenicity prediction to genetic variants that affect RNA splicing

Comparison of in silico strategies to prioritize rare genomic variants impacting RNA splicing for the diagnosis of genomic disorders

Introme accurately predicts the impact of coding and noncoding variants on gene splicing, with clinical applications

Introduction