Abstract
Rhabdophis nuchalis, a snake widely distributed in China, possesses a unique trait: glands beneath the skin on its neck and back, known as nucho-dorsal glands. These features make it a valuable subject for studying genetic diversity and the evolution of complex traits. In this study, we obtained a high-quality chromosome-level reference genome of R. nuchalis using MGI short-read sequencing, PacBio Revio long-read sequencing, and Hi-C sequencing techniques. The final assembly comprised 1.92 Gb of the R. nuchalis genome, anchored to 20 chromosomes (including 9 macrochromosomes and 11 microchromosomes), with a contig N50 of 104.79 Mb, a scaffold N50 of 204.96 Mb, and a BUSCO completeness of 97.50%. Additionally, we annotated a total of 1.09 Gb of repetitive sequences (which constitute 56.51% of the entire genome) and identified 22,057 protein-coding genes. This high-quality reference genome of R. nuchalis furnishes essential genomic data for comprehending the genetic diversity and evolutionary history of the species, as well as for facilitating species conservation efforts and comparative genomics studies.
Similar content being viewed by others
Background & Summary
Recent studies indicate that snakes gradually evolved from lizards during the Early Cretaceous period, approximately 117.68 million years ago1. According to the most recent entries in the Reptile Database (https://reptile-database.reptarium.cz/), there are over four thousand snake species distributed across all continents except Antarctica, occupying diverse ecological niches and demonstrating high species diversity2. This broad distribution and adaptation to various habitats make snakes a vital component of Earth’s biodiversity3. Furthermore, certain snakes have developed distinct characteristics through evolution. For instance, Viperidae and Elapidae snakes exhibit high venom potency4. Snakes within the subfamily Hydrophiinae have adapted to sea life5, while those in the Typhlopidae are adept at living in soil6. Consequently, snakes represent an irreplaceable subject for biodiversity and adaptive evolutionary research. In recent years, high-quality chromosome-level genomes of several snake species have been published, offering valuable insights into unique traits and snake evolution1,7,8,9,10. However, despite these advancements, there remains a significant dearth of available data on snakes, both in terms of quantity and quality of reference genomes, hindering further research in this field.
Malnate (1960) partitioned Natrix sensu lato based on several morphological characters and restored the genus Rhabdophis11, which was established by Fitzinger in 1843 with the R. subminiatus as the type species. Unlike other snakes, the genus Rhabdophis possesses a distinctive trait of having glands beneath the skin of the neck and back, referred to as nuchal and dorsal glands, respectively12. In some species, these glands are confined to the neck (e.g., R. tigrinus)13. These glands harbor potent cardiotonic steroids known as Bufadienolides (BDs), which serve as defensive toxins against predators13,14. According to the latest records from the Reptile Database, there are currently 34 known species of the genus Rhabdophis worldwide. However, to date, there is not yet a reference genome available in the entire genus, which poses a challenge for genomics studies of these species.
In 1891, Boulenger described Tropidonotus nuchalis15 based on a specimen from Hubei, China, subsequently classified as Natrix nuchalis16 by Parker in 1925 and revised by Malnate in 1960 as R. nuchalis11. This species is known by the common name of Hubie keelback and exhibits a wide distribution in China17. Its diet primarily consists of earthworms and firefly larvae. Notably, R. nuchalis acquires BDs from firefly larvae and stores them in its dorsal neck glands, making it an ideal candidate for studying genetic diversity and complex trait evolution14. However, current research on R. nuchalis primarily focuses on morphology18, phylogenetic relationships17, and biogeography19, yet the absence of genomic data has hindered further exploration.
In this study, we successfully assembled and annotated the genome of R. nuchalis at the chromosome level by MGI short-read sequencing, PacBio Revio long-read sequencing, Hi-C20 sequencing, and RNA sequencing (RNA-seq) techniques. We estimated genome size and heterozygosity from clean short reads, performed long-read sequencing using the PacBio Revio System, and combined it with Hi-C20 reads to achieve chromosome-level assembly. Genome annotation was conducted using RNA-seq reads from five tissues (heart, spleen, lung, kidney, and muscle), published genomes of closely related species, and de novo prediction methods. Additionally, we assessed the quality of genome assembly using various metrics. Our efforts culminated in the first high-quality reference genome of the genus Rhabdophis, providing essential genetic data for studying adaptive evolution, genetic diversity, and resequencing analysis of R. nuchalis and the broader genus Rhabdophis.
Methods
Ethics statement
This study adhered to all pertinent ethical and legal guidelines and regulations. The collection of animals and extraction of tissues underwent thorough review and received approval from the Animal Ethics and Welfare Committee of Sichuan Agricultural University (Approval No. 20230121).
Sample collection
An adult female R. nuchalis (body length of 755 mm) was collected from the Shennongjia forest area (latitude: 31.683625, longitude: 110.418075) in Hubei Province, China, for genome sequencing and assembly. Six different tissues (heart, liver, spleen, lung, kidney, and muscle) were sequentially collected and rapidly frozen in liquid nitrogen upon collection, then stored at −80 °C. Liver tissue was utilized for MGI short-read sequencing, PacBio Revio HiFi long-read sequencing, and Hi-C sequencing, while the remaining five tissues were designated for RNA sequencing.
Library construction and sequencing
The collected tissues were sent to GrandOmics Biosciences Co., Ltd. (Wuhan, China), for DNA extraction, library construction, and sequencing. Genomic DNA (gDNA) was extracted from the liver following the manufacturer’s instructions and used for the construction of gDNA libraries. The integrity and purity of the gDNA samples were assessed using agarose gel electrophoresis.
For short-read sequencing, 1.5 μg gDNA was randomly fragmented by Covaris, following the guidelines specified in the device’s operating manual, and 300–400 bp fragments were selected with the Agencourt AMPure XP-Medium kit. The library was then constructed from the selected fragments using the AxyPrep Mag PCR clean-up Kit according to the manufacturer’s instructions. Finally, the qualified libraries were sequenced on the BGISEQ DNBSEQ-T7 platform. This yielded 108.82 Gb of raw reads, and 101.02 Gb of clean reads (with an average depth of coverage of 38.95×) were obtained after quality control using fastp v0.21.021 (Table 1). These clean reads were utilized for genome size estimation and to evaluate the accuracy of genome assembly.
For PacBio HiFi long-read sequencing, 5 µg of gDNA was used to construct SMRTbell libraries following PacBio’s standard protocol (Pacific Biosciences, CA, USA). The process included shearing of gDNA using g-TUBEs (Covaris, USA) according to the expected size of the fragments for the library, DNA damage repair, end repair, and A-tailing, followed by ligating hairpin adapters at both ends of the fragments using the SMRTbell Express Template Prep Kit 3.0 (Pacific Biosciences). After nuclease treatment of the SMRTbell library using the SMRTbell Enzyme Cleanup Kit, target fragments were screened using PippinHT (Sage Science, USA), and the prepared SMRTbell library was sequenced on the PacBio Revio platform instrument with Revio Kit in Grandomics. This resulted in 79.31 Gb of HiFi long reads (with an average coverage depth of 39.65×) for genome assembly (Table 1).
Hi-C libraries were constructed following the protocol22 to obtain the genome at the chromosome level. Key steps included fixation of liver samples using 2% formaldehyde, cleavage of sequences with the DpnII enzyme, end repair, biotin-14-dCTP labeling, ligation with T4 DNA ligase, and uncross-linking and interrupting the sequences. Subsequently, the libraries were sequenced on the BGISEQ DNBSEQ-T7 platform. This generated 209.72 Gb of raw reads, and 209.72 Gb of clean reads (with an average depth of coverage of 107.70×) were obtained after quality control using fastp v0.21.021 (Table 1).
To improve the precision of genome annotation, RNA sequencing was conducted across five distinct tissues: heart, spleen, lungs, kidneys, and muscles. Each tissue underwent RNA extraction utilizing TRIzol reagent (Invitrogen, USA), followed by assessment of RNA purity and concentration using Nanodrop and Qubit, construction of RNA-seq libraries employing the MGIEasy RNA Sample Prep Kit (UW Genetics), and sequencing on the BGISEQ DNBSEQ-T7 platform. A minimum of 6 Gb of sequencing data was guaranteed for each tissue. In total, 40.26 Gb of raw reads were generated, with 40.14 Gb of clean reads obtained post quality control using fastp v0.21.021 (Table 1). These clean reads were utilized for transcriptome annotation of the genome.
Predicting genome size and heterozygosity
The genome size and heterozygosity of R. nuchalis were predicted using KMC v3.2.123 and GenomeScope v124 software before assembly. Initially, the short reads, post-quality control, underwent analysis with KMC v3.2.123 (parameter k = 17) to generate the k-mer frequency distribution table. Subsequently, the obtained k-mer frequency distribution table was analyzed using GenomeScope v124 software to derive genome prediction information. Finally, the prediction results indicated a genome size of 1.57 Gb and a heterozygosity of 1.20% (Table 2).
De novo assembly of the R. nuchalis genome
De novo assembly of the R. nuchalis genome was conducted using the obtained HiFi long reads through hifiasm v0.16.025. We acquired the preliminary assembled genome, which underwent comparison with the NT (Nucleotide Sequence Database) library. Sequences longer than 1 Mb were subjected to 50 kb cuts, and contaminating reads (non-target macroclasses, mitochondria) were subsequently removed from the genome to yield the final assembly. The resulting genome size of R. nuchalis, post-contamination removal, was 1.93 Gb, with a contig N50 of 104.79 Mb (Table 3).
To assess the quality of the genome assembly, we first employed BUSCO v4.0.526 (Benchmarking Universal Single-Copy Orthologs) to evaluate completeness. This involved analyzing single-copy homologous genes in the OrthoDB database vertebrata_odb10. The analysis revealed that 3,270 (97.50%) out of 3,354 BUSCO groups were identified as complete, including 3,232 complete and single-copy BUSCOs (96.36%), and 38 complete and duplicated BUSCOs (1.13%), indicating high completeness of the assembled genome (Table 4).
Furthermore, to evaluate the accuracy of the assembly, clean short reads and HiFi long reads were mapped to the R. nuchalis genome using BWA v0.7.1527 and minimap228, respectively. The results indicated that at a coverage depth of 1×, the clean short reads and HiFi long reads achieved 98.24% and 99.97% coverage across the entire genome, respectively (Table 5). This demonstrates the high accuracy of the genome assembly.
Hi-C assisted assembly
We employed a multi-step process to assemble the genome of R. nuchalis to the chromosome level using quality-filtered Hi-C reads. Firstly, clean Hi-C reads were aligned to genomes assembled with HiFi long reads using bowtie2 v2.3.229 to obtain uniquely mapped paired-end reads. Subsequently, HiC-Pro v2.8.130 was utilized to identify and retain valid interacting paired-end reads from these uniquely mapped pairs while filtering out invalid sequences such as dangling-end, self-cycle, re-ligation, and dumped products.
Subsequently, the scaffolds underwent further clustering, sorting, and chromosomal localization using LACHESIS v131. Subsequent manual adjustments were made to the genome using Juicebox v1.11.0832 to derive the final pseudochromosomes. The chromosomes, GC content, gene density, abundance of repetitive sequences, and ncRNA distribution of the genome were visualized using the advanced circos33 in TBtools II34 (Fig. 1B). The analysis unveiled that R. nuchalis features 20 chromosomes, consisting of 9 macrochromosomes and 11 microchromosomes (with a 50 Mb threshold in squamates35). Chromosome sizes varied from 14.96 Mb to 411.07 Mb, contributing to a total genome size of 1.92 Gb (Tables 3, 6, and Fig. 1). Notably, the contig N50 stood at 104.79 Mb, while the scaffold N50 reached 204.96 Mb (Table 3). This comprehensive approach facilitated the structuring of the genome into chromosomal configurations, offering profound insights into the genomic architecture of R. nuchalis.
Repeat sequence annotation
Repeat sequences, comprising tandem repeats (TRs) and transposable elements (TEs), were annotated in the genome of R. nuchalis using a combination of software tools and databases. For TRs, we employed GMATA v2.236 and Tandem Repeats Finder (TRF v4.07b37) software pairs. GMATA v2.236 identified simple repeat sequences (SSRs), while TRF v4.07b37 identified all tandem repeats in the genome. Regarding TEs, a dual approach of de novo and homologous annotation was adopted. Firstly, transposable elements were de novo annotated using MITE-hunter38 and RepeatModeler v1.0.1139 software, in which also uses LTR_FINDER40, LTR_harvest41 and LTR_retriver42 for synchronization detection of repeat sequences. Subsequently, the obtained libraries were compared with the TEclass Repbase database to categorize each repeat family using TEclass v2.1.343. Furthermore, RepeatMasker v1.33144 was utilized to search for both known and novel TEs by localizing sequences from de novo repeat libraries and Repbase repeat libraries. Overlapping transposon factors belonging to the same repeat class were sorted and combined.
The results indicated that a total of 1.09 Gb of repetitive sequences were annotated in the genome of R. nuchalis, constituting 56.51% of the entire genome. Among these, TRs and TEs accounted for 13.78 Mb and 885.68 Mb in size, representing 0.72% and 46.02% of the whole genome, respectively. Class I and Class II TRs comprised 628.50 Mb and 257.18 Mb, contributing to 32.66% and 13.36% of the entire genome, respectively (Table 7). This comprehensive annotation provides insights into the repetitive landscape of the R. nuchalis genome.
Gene structure annotation
In the structural annotation of the R. nuchalis genome, we initially applied RepeatMasker v1.33144 to soft-mask the annotated repetitive sequences. Subsequently, gene structure prediction was conducted through three methods: homology prediction, transcriptome prediction, and de novo prediction, with integration of the results to derive the final gene structure annotation. For homology prediction, comparisons were made with the genomes of five closely related species (Ahaetulla prasina7, Calamaria septentrionalis1, Pantherophis guttatus1, Thamnophis elegans NCBI accession GCA_009769535.1, and Thermophis baileyi8) using GeMoMa v1.6.145 software. Transcriptome prediction involved mapping quality-controlled RNA-seq reads to the R. nuchalis genome using STAR v2.7.3a46, followed by transcript assembly with Stringtie v1.3.4d47 and prediction of open reading frames (ORFs) using PASA v2.3.348. De novo prediction entailed reassembly of RNA-seq reads using Stringtie v1.3.4d and analysis with PASA v2.3.348 to generate a training set, followed by de novo gene prediction using Augustus v3.3.149. Finally, the predictions were integrated using EVM v1.1.148 (EVidenceModeler).
The results indicated that homology prediction, transcriptome prediction, and de novo prediction annotated 48,439, 18,203, and 20,575 genes, respectively, with a final count of 22,057 protein-coding genes successfully annotated after EVM v1.1.148 integration. Among them, the average gene length and CDS length were 34,853.45 bp and 1,617.01 bp, respectively. Each exon contained an average of 9.12 genes, while the average lengths of exons and introns were 177.32 bp and 4,093.52 bp, respectively (Table 8).
Gene function annotation
We have successfully completed the functional gene annotation of the R. nuchalis genome by utilizing five key public databases: GO (Gene Ontology)50, SwissProt51, NR (Non-Redundant protein Database), KEGG (Kyoto Encyclopedia of Genes and Genomes)52, and KOG (Eukaryotic Orthologous Groups of proteins)53. In the case of the GO database, we employed the default parameters of the InterProScan v5.3254 program for gene function annotation. For the remaining four databases, we utilized Blastp v2.7.1 to annotate gene functions. The results revealed that 13,451, 18,567, 19,655, 14,474, and 13,362 genes were annotated in GO50, SwissProt51, NR, KEGG52, and KOG53, respectively, accounting for 60.98%, 84.18%, 89.11%, 65.62%, and 60.58% of the total number of genes in R. nuchalis (Table 9). Notably, 9,343 genes were annotated across all five databases (Fig. 2). By integrating the annotation outcomes from these databases, we completed the functional annotation of 19,918 genes, representing 90.30% of the total gene count (Table 9, Fig. 2).
Subsequently, we conducted an evaluation of the genome annotation results. Initially, the annotated genes were assessed using BUSCO v4.0.526 based on the OrthoDB database vertebrata_odb10. The evaluation revealed that 3,237 complete genes were identified within 3,354 BUSCO groups, accounting for 96.51% of the database, underscoring the high completeness of the annotated genome of R. nuchalis (Table 4). Furthermore, we compared the genome of R. nuchalis with the published genomes of five closely related species, which exhibited a total gene count ranging from 18,213 to 22,959 genes (Table 10). Remarkably, R. nuchalis possessed 22,057 genes, aligning well with the published species (Table 10). Additionally, in terms of gene length, average CDS length, exon length, the average number of exons per gene, intron length, and the distribution of intron number, R. nuchalis exhibited consistency with the five closely related species (Table 10, Fig. 3).
Non-coding RNA (ncRNA) annotation
The annotation of ncRNAs in the R. nuchalis genome was accomplished through a combination of database searching and model prediction methods. Specifically, tRNAs were annotated using tRNAscan-SE v2.055, while MicroRNAs, rRNAs, small nucleolar RNAs, and small nucleolar kernel RNAs were identified by searching the Rfam database56 using Infernal v1.1.2 cmscan57. Additionally, RNAmmer v1.258 prediction was employed for the annotation of rRNAs and their subunits. The results showed that a total of 3,599 ncRNA were annotated in the R. nuchalis genome, including 397 rRNA, 981 snRNA, and 2,063 tRNA (Table 11).
Data Records
All the raw sequencing data generated in this study have been uploaded to the NCBI Sequence Read Archive (SRA) database with the accession number SRP50004559. The assembled chromosome-level genome data have been deposited in Genbank with the accession number GCA_039707465.160. The genome annotation data have been uploaded to Figshare (https://doi.org/10.6084/m9.figshare.25559178.v1)61.
Technical Validation
To assess the accuracy and completeness of the assembled genome of R. nuchalis, we conducted BUSCO v4.0.526 assessment, identifying 3,270 complete BUSCO genes out of 3,354, indicating 97.50% completeness(Table 4). Furthermore, mapping clean short reads and HiFi long reads to the genome revealed 98.24% and 99.97% mapping ratio, respectively, at a coverage depth of 1×, demonstrating high accuracy (Table 5). Additionally, for genome structure annotation, BUSCO assessment yielded 3,237 complete genes out of 3,354 BUSCO groups, representing 96.51% completeness (Table 4). Comparison with five closely related species showed consistency in gene count and various gene parameters, affirming the effectiveness of genome annotation (Table 10, Fig. 3).
Code availability
No specific code was used in this study. All analytical processes were executed according to the manuals and protocols of the corresponding bioinformatic tools. The software parameters used in this study are as follows: fastp v0.21.0: -n 0 -f 5 -F 5 -t 5 -T 5 -q 20KMC v3.2.1: -k17 -ci1 -cs1000000GenomeScope v1: defaulthifiasm v0.16.0: defaultBUSCO v4.0.5: -l vertebrata_odb10 -g genomeBWA v0.7.15: defaultminimap2: -x map-hifibowtie2 v2.3.2: -end-to-end --very-sensitive -L 30HiC-Pro v2.8.1: -c confg-hicpro.txt -i -oLACHESIS v1: CLUSTER_MIN_RE_SITES = 100,CLUSTER_MAX_LINK_DENSITY = 2.5, CLUSTER NONINFORMATIVE RATIO = 1.4, ORDER MIN N RES IN TRUNK = 60, ORDER MIN N RES IN SHREDS = 60Juicebox v 1.11.08: defaultTBtools II: defaultGMATA v2.2: defaultTRF v4.07b: 2 7 7 80 10 50 500 -f -d -h -rMITE-hunter: -n 20 -P 0.2 -c 3RepeatModeler v1.0.11: -engine wublastLTR_FINDER: defaultLTR_harvest: defaultLTR_retriver: defaultTEclass v2.1.3: defaultRepeatMasker v1.331: nolow -no_is -gff -norna -engine abblast -lib libGeMoMa v1.6.1: defaultSTAR v2.7.3a: -outWigType bedGraph --outSAMtype BAM SortedByCoordinate--outSAMstrandField intronMotifStringtie v1.3.4d: defaultPASA v2.3.3: -c alignAssembly.config -C -R -g genome.fasta -T -u trans.fasta -t trans.clean.fasta -f fl.acc --CPU 10 --ALIGNERS gmapAugustus v3.3.1: --gff3 = on --hintsfile = hints.gff --extrinsicCfgFile = extrinsic.cfg--allow_hinted_splicesites = gcag, atac --min_intron_len = 30 --softmasking = 1EVM v1.1.1: --segmentSize 1000000 --overlapSize 100000InterProScan v5.32: defaultBlastp v2.7.1: -e 1e-5tRNAscan-SE v2.0: --thread 4 -E -IInfernal v1.1.2: defaultRNAmmer v1.2: -S euk -m lsu, ssu, tsu -gff
References
Peng, C. et al. Large-scale snake genome analyses provide insights into vertebrate development. CELL 186, 2959 (2023).
Zug, G. R., Vitt, L. J. & Caldwell, J. P. Herpetology:An introductory biology of amphibians and reptiles. SYST. BIOL. 42, 592 (1993).
Pyron, R. A., Burbrink, F. T. & Wiens, J. J. A phylogeny and revised classification of Squamata, including 4161 species of lizards and snakes. BMC EVOL. BIOL. 13, 93 (2013).
Zhao, E. M. Snakes of China. (Anhui Science and Technology Publishing House, Hefei, Anhui., 2006).
Sanders, K. L., Lee, M. S. Y., Leys, R., Foster, R. & Keogh, J. S. Molecular phylogeny and divergence dates for Australasian elapids and sea snakes (Hydrophiinae): Evidence from seven genes for rapid evolutionary radiations. Journal of evolutionary biology 21, 682–695 (2008).
Beatriz, S. M. T. G. Intrauterine and post‐ovipositional embryonic development of Amerotyphlops brongersmianus (Vanzolini, 1976) (Serpentes: Typhlopidae) from northeastern Argentina. J. Morphol. 281, 523–535 (2020).
Tang, C. Y. et al. Genetic mapping and molecular mechanism behind color variation in the Asian vine snake. Genome Biol. 24, 46 (2023).
Yan, C. et al. Temperature acclimation in hot-spring snakes and the convergence of cold response. Innovation-Amsterdam 3, 100295 (2022).
Margres, M. J. et al. The Tiger Rattlesnake genome reveals a complex genotype underlying a simple venom phenotype. Proceedings of the National Academy of Sciences 118, e2014634118 (2021).
Li, A. et al. Two Reference-Quality Sea Snake Genomes Reveal Their Divergent Evolution of Adaptive Traits and Venom Systems. Mol. Biol. Evol. 38, 4867 (2021).
Malnate, E. V. Systematic division and evolution of the colubrid snake genus Natrix, with comments on the subfamily Natricinae. P. Acad. Nat. Sci. Phila. 112, 41 (1960).
Takeuchi, H. et al. Evolution of nuchal glands, unusual defensive organs of Asian natricine snakes (Serpentes: Colubridae), inferred from a molecular phylogeny. Ecol. Evol. 8, 10219 (2018).
Mori, A. et al. Nuchal glands: a novel defensive system in snakes. Chemoecology 22, 187 (2012).
Yoshida, T. et al. Dramatic dietary shift maintains sequestered toxins in chemically defended snakes. Proceedings of the National Academy of Sciences 117, 5964 (2020).
Boulenger, G. A. Descriptions of new oriental reptiles and batrachians. Annals and Magazine of Natural History 7, 279 (1891).
Parker & H., W. eds. XXVIII.— Variation of the Leopidosis of a snake from S.E. Asia. (1925).
Liu, Q., Lyu, B., Xie, X., Zeng, Y. & Guo, P. Genomic evidence sheds new light on phylogeny of Rhabdophis nuchalis (sensu lato) complex (Serpentes: Natricidae). MOL. Phylogenet. Evol. 189, 107893 (2023).
Mori, A. et al. Morphology of the nucho-dorsal glands and related defensive displays in three species of Asian natricine snakes. Journal of zoology 300, 18 (2016).
Zhu, G. et al. Cryptic diversity and phylogeography of the Rhabdophis nuchalis group (Squamata: Colubridae). Mol. Phylogenet. Evol. 166, 107325 (2022).
Belton, J. M. et al. Hi-C: A comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884 (2018).
Rao, S. S. P. et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell 159, 1665–1680 (2014).
Sebastian et al. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics (Oxford, England) 31, 1569–1576 (2015).
Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 31, 2202–2204 (2017).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods. 18, 1 (2021).
Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210 (2015).
Heng, L. & Richard, D. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Li, H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32, 2103 (2016).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357 (2012).
Belaghzal, H., Dekker, J. & Gibcus, J. H. Hi-C 2.0: An optimized Hi-C procedure for high-resolution genome-wide mapping of chromosome conformation. Genome. Biol. 123, 56–65 (2017).
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119 (2013).
Durand, N. C. et al. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 3, 99 (2016).
Chen, C., Wu, Y. & Xia, R. A painless way to customize Circos plot: From data preparation to visualization using TBtools. iMeta 1, 35 (2022).
Chen, C. et al. TBtools-II: A “one for all, all for one” bioinformatics platform for biological big-data mining. Mol. Plant 16, 1733 (2023).
Waters, P. D. et al. Microchromosomes are building blocks of bird, reptile, and mammal chromosomes. Proceedings of the National Academy of Sciences 118, e2112494118 (2021).
Wang, X. & Wang, L. GMATA: An Integrated Software Package for Genome-Scale SSR Mining. Marker Development and VIewing. Frontiers in plant science. 7, 1350 (2016).
Gary, B. Tandem repeats finder: a program to analyze DNA sequences. Nucleic. Acids. Res. 27, 573–580 (1999).
Han, Y. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic Acids Res. 38, e199 (2010).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. P. Natl. Acad. Sci. USA. 117, 9451 (2020).
Zhao, X. & Hao, W. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265 (2007).
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18 (2008).
Ou, S. & Jiang, N. LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons. Plant Physiol. 176, 1310 (2017).
György, A., Norbert, G., Luc, D. M. & Wojciech, M. TEclass–a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 25, 1329–1330 (2009).
Bedell, J. I. W. MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics 16, 1040–1041 (2000).
Jens et al. Using intron position conservation for homology-based gene prediction. Nucleic Acids. Res. 9, e89 (2016).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15 (2013).
Tung, L. H., Shao, M. & Kingsford, C. Quantifying the benefit offered by transcript assembly with Scallop-LR on single-molecule long reads. Genome Biol. 20, 287 (2019).
Haas, B. J., Salzberg, S. L., Zhu, W. & Pertea, M. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, R7 (2008).
Stanke et al. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics Oxford 24, 637–644 (2008).
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).
Amos, B. & Rolf, A. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 27, 49–54 (1999).
Ogata, H. et al. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 27, 29–34 (1999).
Galperin, M. Y., Makarova, K. S., Wolf, Y. I. & Koonin, E. V. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 43, D261 (2015).
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236 (2014).
Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res. 49, 9077–9096 (2021).
Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33, D121–D124 (2005).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Lagesen, K. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Oxford University Press 35, 3100–3108 (2007).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP500045 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_039707465.1 (2024).
MW, D. Genome annotation of the Rhabdophis nuchalis. figshare. Dataset. https://doi.org/10.6084/m9.figshare.25559178.v1 (2024).
Acknowledgements
This study was supported by a grant of the National Natural Science Foundation of China (NSFC 32270477). This research was also supported in part by a grant to Zhu GX (2016 M592688) from the China Postdoctoral Foundation.
Author information
Authors and Affiliations
Contributions
Mingwen Duan: Conceived and designed the experiments; data curation and analysis; writing (drafted the manuscript, review and editing). Shijun Yang: Conceptualization; data curation; investigation; writing (original draft, review and editing). Xiufeng Li, Xuemei Tang, Yuqi Cheng, Jingxue Luo, Ji Wang, Huina Song, and Qin Wang: Sample collection, investigation; methodology; writing (review and editing). Guangxiang Zhu: Conceived and designed the experiments; data curation; funding acquisition; resources; supervision; writing (drafted the manuscript, review and editing).
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Duan, M., Yang, S., Li, X. et al. Chromosome-level genome assembly and annotation of the Rhabdophis nuchalis (Hubei keelback). Sci Data 11, 850 (2024). https://doi.org/10.1038/s41597-024-03708-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03708-z
- Springer Nature Limited