Background & Summary

Topmouth culter (Culter alburnus) is a kind of fierce carnivorous Cyprinidae species1, living in the middle and upper layers of water. It is distributed in almost every river and lake in China2, and also in the Russian Far East, Mongolia, the Korean Peninsula, and the Southeastern Peninsula3. As is known to all, C. alburnus has high economic values and special ecological roles4,5, which has motivated the extensive research interests of ichthyologists. In recent decades, the resources of C. alburnus have declined significantly and the population tends to be younger and more miniaturized6. The natural lakes and rivers have suffered the most severe resource declines. This is related to habitat destruction due to anthropogenic activities such as overfishing and dam construction7,8. To find corresponding strategies for conservation of germplasm resources, the researchers have analyzed the genetic diversity of wild C. alburnus populations in large basins using mitochondrial DNA and microsatellite markers techniques to evaluate the status of C. alburnus germplasm resources in various watersheds3,9,10. Furthermore, the sight and smell of C. alburnus are very sensitive. It is vulnerable to startle and has a strong stress response, which leads it to be not easy to transport1. These characteristics have posed a challenge to the development of farming and breeding efforts. In this case, the intolerance to stress and sensitivity of C. alburnus have become urgent problems to be solved. Therefore, fish breeders have begun to focus on cross breeding and selective breeding programs of C. alburnus. A few researchers have attempted to genetically improve the characteristics of C. alburnus itself in order to enhance economic value and growth adaptability. Studies on the hybrid breeding of C. alburnus with Megalobrama fishes showed that the hybrid and backcross varieties have faster growth speed and stronger disease resistance than their parents11,12,13,14,15. A transcriptome sequencing-based study identified several genes and pathways associated with the reproduction of C. alburnus16. These existing studies are committed to improving the resource status and culture dilemma of this species. However, the lack of complete genomic resources of C. alburnus released in public databases has limited researchers from parsing the genetic basis for the formation of important traits and the molecular mechanism in response to environmental changes. With the development of high-throughput sequencing technology, species evolution analysis, trait-related gene mining, and genome breeding based on fish genome resources have gradually become the research hotspots in the field of aquaculture17. For marine fishes, the genome resources of Gadus morhua exposed its specific immune mechanism18; the detailed genetic mapping of Ictalurus punctatus revealed the molecular foundation of scale formation19; the growth traits and breeding study based on Lateolabrax maculatus genome identified the rrBLUP model as the best GS model for growth20. For freshwater fishes, the development of Cyprinus carpio genome resources has promoted the study about the genetic mechanism, origin and evolution of Cyprinidae fish21; the acquisition of Ctenopharyngodon Idella genome resources provided a basis for analyzing related traits in growth and feeding22. Therefore, it is necessary to assemble and obtain high-quality genome resources for gaining more comprehensive biological information to meet research demands.

In this research, we utilized the Illumina NovaSeq 6000 and PacBio Sequel II platforms for genome sequencing, which generated 102.13 Gb Hi-C short genome clean reads and 60.76 Gb HiFi long genome reads, respectively. We employed Hi-C data to assist assembly based on the contig assembled sequences from HiFi data, so as to obtain a chromosomal-level assembly result. The final assembled genome size was nearly 1.052 Gb, with a contig N50 of 32.92 Mb and a scaffold N50 of 43.09 Mb. A total of 28,228 protein-coding genes were predicted, and 27,000 of these predicted genes could be annotated by at least one database (Pfam, NR, Swissprot, KOG, GO, COG and KEGG). Additionally, we annotated the 598.23 Mb of repetitive sequences, covering 56.82% of the assembled genome. And non-coding RNA predictions identified a total of 2,061 rRNAs, 1,771 snRNAs, 2,188 miRNAs, 15,612 tRNAs, 1,226 Cis-regs and only 9 of ribozymes. The genomic resources will be helpful for us to understand the adaptability and survivability of C. alburnus at the molecular level, as well as its genetic basis for environmental changes and stress. This will facilitate the resource conservation and germplasm improvement of C. alburnus, and provide a scientific foundation for selective breeding and culture management of this species.

So far, three genome assembly results have been published for C. alburnus (GCA_009869775.1, GWHBOSX00000000, GCA_028476615.1)23,24,25. However, these results in the database do not have detailed information about genes on chromosomes or other functional elements related to them. Comparing our assembly results with that, it can be seen that the genome size is very consistent (Table 1). The results show that our assembly has 68 contigs, a far lower number compared to other assembly versions. This means that the assembly is closer to the true structure and sequence of the genome. Contig N50 (32,920,630 bp) and scaffold N50 (43,094,527 bp) are much longer than other genomic versions, suggesting that the size distribution of our assembled contigs is more uniform, with more contigs longer (Fig. 1). This means better assembly quality, and it can provide more relevant information for gene structure, function annotation and evolutionary research (Fig. 2). Significantly, our genome version only has 31 gaps, far less than other versions, which indicates that the genome sequence is relatively complete. On the whole, we have a high-quality and much more complete genome, which can provide higher reliability and usability for further gene function analysis and bioinformatics studies.

Table 1 Comparison of C. alburnus published genome assembly results.
Fig. 1
figure 1

Contig length distribution of different versions of C. alburnus genome. The x-axis represents the different versions of the genome, and the y-axis is the length distribution (with a log10 scale gradient).

Fig. 2
figure 2

BUSCO assessment results for different versions of genome assembly results.

Methods

Sample collection and sequencing

The individual samples of C. alburnus used in this study were obtained from Lake Kuncheng, Changshu City, Jiangsu Province, China, on April 18th, 2023. Genomic DNA from the white muscle tissue was extracted using the SDS method to construct the DNA sequencing library. In addition, a portion of the samples were used for RNA extraction using the TRIzol26 reagent to obtain full-length transcriptome sequencing data and then assist in genome annotation. All samples were freshly frozen in liquid nitrogen and then stored at −80 °C until use.

The Illumina NovaSeq 6000 (Illumina, USA) and PacBio Sequel II platform (PacBio, USA) were applied for genomic sequencing to generate short and long genomic reads, respectively. Hi-C library was constructed using fresh tissue samples, and cell cross-linking was performed by fixing the samples with formaldehyde in order to cross-link intracellular proteins to DNA and DNA to DNA. DNA was digested by restriction endonuclease DPN II to create sticky ends on both sides of the crosslinks, and then end-repaired was performed to introduce biotin-labeled bases. Next, DNA circularization was conducted between DNA fragments containing interactions. The ligated DNA was then sheared into 300–700 bp fragments, and the DNA fragments containing the interactions were captured using magnetic beads with streptavidin for library construction. After the library construction was completed, the concentration and insert size of the library were detected using Qubit 3.0 and Agilent 2100, respectively. The effective concentration of the library was accurately quantified using Q-PCR methods to ensure the quality of the library. Finally, high-throughput sequencing was performed by the Illumina Novaseq 6000 platform to generate paired-end reads of 150 bp and then filtered the raw data into clean reads using Fastp (V0.20.0, https://github.com/OpenGene/fastp)27 software. For PacBio sequencing, DNA was fragmented after the samples passed quality inspection. Next, the fragmented DNA was subjected to damage repair, end repair, ligation of splice sequences, enzymatic processing, and PB-purification. The PacBio Binding kit (PacBio, USA) was used to combine primer (PacBio, USA) and polymerase (PacBio, USA) with the library before machine sequencing. Purification of the final reaction product was accomplished with AMpure PB Beads (PacBio, USA). Sequencing used the CCS sequencing mode (HiFi) of the PacBio Sequel II platform, which generated totaling approximately 60.76 Gb HiFi reads (Fig. 3a, Table 2). In the version of SMRT Link (v11.0.1.162970)28 utilized, the PacBio Sequel II instrument inherently produced HiFi data as its default output. The mean and N50 lengths of the subreads were 12,578 bp and 12,577 bp, respectively (Table 2). The sequencing depth was 58× . From the Hi-C sequencing, 102.13 Gb clean data were obtained, containing 341.73 Mb of clean paired-end reads, and the GC content was 38.78% with Q20 and Q30 rates of 97.90% and 93.79%, respectively (Table 2). The sequencing depth was 97×. The RNA-seq library (cDNA-PCR library) was sequenced using the PromethION 48 device (Oxford Nanopore Technologies, UK). The raw data was converted from the current signal to base sequence information by Guppy software in MinKNOW (v2.1) software package (https://community.nanoporetech.com/downloads). A total of about 11.68 Gb RNA raw reads were generated and used for gene prediction (Fig. 3b, Table 2).

Fig. 3
figure 3

Raw reads statistical results. (a)The length and quality score visualization results of HiFi raw reads (b)The length and quality score visualization results of ONT raw reads.

Table 2 Statistics for the sequencing data of C. alburnus genome.

Genome assembly and completeness of the assembled genome

The genome size, proportion of repetitive sequences, and heterozygosity were estimated by using 21-mer cluster frequency distribution analysis. We used HiFi clean reads as the input file and utilized Jellyfish (v2.3.0)29 with the parameter “-m 21” to determine the K-mer frequency distribution. The results were then analyzed by GenomeScope (v2.0)30 with the parameter “ -21”. The K-mer distribution curve is shown in Fig. 4, and the K-mer peak coverage was 57.4×. The estimated genome size of C. alburnus was about 913 Mb, the heterozygosity was about 0.56%, and the proportion of repetitive sequences was 45.30%.

Fig. 4
figure 4

K-mer analysis (K = 21) of C. alburnus genome.

PacBio sequencing was the primary sequencing method for genome assembly, and the Illumina Hi-C sequencing data was used mainly to assist in genome assembly. For the HiFi data generated by the PacBio Sequel II platform, we assembled them using the default parameters of the Hifiasm (v0.16.1, https://github.com/chhylp123/hifiasm)31 software, and the main contigs obtained were adjusted for subsequent analyses. Contigs were clustered using the ALLHIC (v0.9.8)32 software to determine the closeness of the association between contigs. The ALLHIC pipeline included several steps of read mapping, pruning, partitioning, salvaging, optimization, and construction, which can come to analyze the relationship between contigs in a comprehensive way. Then, Juciertools (v3.0, https://github.com/aidenlab/JuicerTools)33 software was used to transform the interaction between contig pairs into a specified binary file, and then Juciebox (v2.15.07)34 software was used to manually correct the ordered and directed contigs so as to obtain the final assembly result at the chromosome level. The interaction signals were then shown as a heat map using ALLHIC. We were employing assembled HiFi data for computational completeness and QV values. In our research, BUSCO software (v5.2.1, odb10)35 was used to evaluate the completeness of genome assembly using assembled genome sequences based on single copy homologous gene alignment. We also employed a K-mer-based approach to evaluate genome quality. Merqury (v1.3)36 was utilized to provide genome assembly completeness and QV values by comparing K-mer frequency distributions of genome assembly and whole genome sequencing data. The blastp program in BLAST (v2.10.1+)37 with the parameter “-evalue 1e-5” was used to identify homologous genomes, then MCScanX was used to identify collinear genes, and finally, we used Circos (v0.69, http://circos.ca/documentation/tutorials/ideograms/karyotypes/) to plot circles.

Based on the sequencing data, a high-quality genome assembly of C. alburnus was constructed and obtained at the chromosome level. The genome size was 1.052 Gb, with a contig N50 of 32.92 Mb and a scaffold N50 of 43.09 Mb (Table 3). The assembly result was anchored to 24 chromosomes, accounting for 99.85% of the entire genome as a high percentage (Table 4). In our Hi-C interaction map, a sequence within Chromosome 2 (Chr2) displayed heightened compactness, engaging in robust interactions with some other chromosome regions. Based on this observation, we speculated that this segment may be situated near the centromere of Chromosome 2 (Fig. 5). The genome characteristics of C. alburnus were displayed in Fig. 6. A total of 2956 collinearity genes were identified utilizing MCScanX (Table S1). The result of completeness assessment of the assembly indicated that the genome assembly covered 98.3% of complete BUSCOs and 96.3% of single-copy BUSCOs, with only 2% of duplicated BUSCOs (Table 5). Evaluation based on the K-mer method showed that the QV value of the assembled genome was 61.33 (Table 3). These results indicated that the assembly of C. alburnus genome was complete and high-quality.

Table 3 Statistics of genome assembly quality of C. alburnus.
Table 4 Statistics of chromosome assembly.
Fig. 5
figure 5

Heatmap of 24 chromosome interactions with Hi-C data.

Fig. 6
figure 6

Genome characteristics of C. alburnus. The first circle (from outside to inside) represents different chromosome sequences, the second circle represents gene density, the third circle represents GC content, the fourth circle represents repeat sequence content, the fifth circle represents the SNPs density distribution information, the sixth circle represents the INDELs density distribution information, and the middle line represents genes with collinearity (The density statistics window was 1 Mb).

Table 5 The completeness of the genome assembly and annotation was assessed using BUSCO.

Repeat annotation and non-coding RNA annotation

Repeated sequences can be classified into dispersed repeat sequences and tandem repeat sequences. To identify repetitive sequences, we first used RepeatModeler (v2.0.1, http://www.repeatmasker.org/RepeatModeler/)38 with default parameters for de novo prediction of repetitive sequences in the genome, and then merged the predicted results with the repbase database (http://www.girinst.org/repbase)39. Next, RepeatMasker (v4.1.0, http://www.RepeatMasker.org)40 was used to predict the repetitive sequences of the genome, then we used the RepeatProteinMask tool in RepeatMasker to predict the repetitive sequences and finally integrated them to get the annotation results of the repetitive sequences. The prediction of non-coding RNA included three parts: using Barrnap (v0.9, https://github.com/tseeann/barrnap)41 with parameters “–kingdom euk” to predict ribosomal RNA (rRNA) sequence; prediction of transfer RNA (tRNA) sequence using tRNASCAN(v2.0.0, http://lowelab.ucsc.edu/tRNAscan-SE/)42; In addition, the Rfam database (ftp://ftp.ebi.ac.uk/pub/databases/Rfam/)43 was searched by using Infernal (v1.1.3, https://github.com/EddyRivasLab/infernal)44 to predict other non-coding RNA (miRNA, snRNA, etc.).

A total of 598.23 Mb repetitive sequences were identified, covering 56.82% of the assembled genome, as indicated by de novo and homology-based predictions. DNA transposable element should be the most abundant transposable element in the genome, accounting for 35.05% of the genome with a length of 368.99 Mb (Table 6). The other types of TEs detected were long interspersed elements (LINE) (32.54 Mb; 3.09%), short interspersed elements (SINE) (2.58 Mb; 0.24%) and long terminal repeats (LTR) in 87.82 Mb (8.34%) (Table 6), respectively. Additionally, 0.59% of the satellite sequences in the genome were identified with a length of 6.17 Mb. Non-coding RNA predictions identified a total of 2,061 rRNAs, 1,771 snRNAs, 2,188 miRNAs, 15,612 tRNAs, 1,226 Cis-regs and only 9 of ribozymes (Table 7).

Table 6 Prediction statistics of genome repeat sequence of C. alburnus.
Table 7 Statistics of the noncoding RNA in the C. alburnus genome.

Gene prediction and annotation

The messenger RNA (mRNA) prediction includes ab initio prediction, homology-based prediction and transcript-based prediction. Augustus (v3.3.3, https://github.com/Gaius-Augustus/Augustus)45 with parameters “–gff3 = on–genemodel = complete–noInFrameStop = true” and GlimmerHMM (v3.0.4)46 with default parameters were used for ab initio prediction, while Exonerate (v2.4.0, https://github.com/nathanweeks/exonerate)47 with parameters “–showtargetgff–model protein2genome–percent 50” and GeMoMa (v1.9)48 with the parameters “CLI GeMoMaPipeline, Extractor.r = true tblastn = false GeMoMa.Score = ReAlign AnnotationFinalizer.r = NO o = true” were used for homologous prediction. For the homology-based prediction, the protein sequences of five Cypriniformes species including Ctenopharyngodon Idella (GCF_019924925.1)49, Misgurnus anguillicaudatus (GCF_027580225.1)50, Onychostoma macrolepis (GCF_012432095.1)51, Puntigrus tetrazona (GCF_018831695.1)52, Xyrauchen texanus (GCF_025860055.1)53 were downloaded from the National Center for Biotechnology Information (NCBI). RNA-seq data (ONT) was used to assist in predicting the coding structure of genes, which was compared to the genome using Minimap2 (v2.15)54 with the parameter “-ax splice”, and sorted using Samtools (v1.9, https://github.com/samtools/samtools), and then StringTie (v2.1.3b)55 was used with default parameters to reconstructed transcripts. The ORF prediction and coding region prediction were performed by TransDecoder (v5.1.0, https://github.com/TransDecoder/TransDecoder)56 with default parameters. Multiple data sets were integrated using EVidenceModeler (v1.1.1, partition_EVM_inputs.v2.pl with parameters “--segmentSize 500000--overlapSize 10000”)57, and finally, PASA (v2.5.2, https://github.com/PASApipeline/PASApipeline)58 with default parameters was used to update the integrated data, add UTR areas and discover new transcripts. Finally, in order to obtain annotation information of the predicted genes, we compared the longest transcript nucleic acid sequences of the predicted genes with the NR59, Swiss-Prot60, GO61, COG62, KOG63, KEGG64 databases by using BLAST (v2.10.1+)37 software with the parameters “-evalue 1e-5”. And the amino acid sequences were compared with the Pfam65 database using the HMMER (v3.2.1)66 software with default parameters.

There were altogether 28,228 protein-coding genes predicted by combining three different approaches: ab initio prediction, homology-based method, and RNAseq-based prediction. The average values of the transcript length, CDS length, exon length, and average intron length were 16,754 bp, 1,604 bp, 218 bp, and 1,799 bp based on methods of ab initio prediction, homology-based method, and RNA-seq based prediction (Table 8). The statistics of the predicted gene models of C. alburnus showed similar distribution patterns in gene length, CDS length, exon length and intron length compared to the other five Cypriniformes species (Fig. 7). By alignment to the nucleotide, protein, and annotation databases Pfam, NR, Swissprot, KOG, GO, COG and KEGG, altogether 27,000 predicted genes that could be annotated by at least one database (Table 9). Among them, there were 9,627 genes between 300 bp and 1,000 bp in length and 16,366 genes with lengths greater than 1,000 bp (Table 9). And 13,520 of the functional proteins were supported by all five databases (GO, KEGG, KOG, NR, Pfam) (Fig. 8).

Table 8 Summary of protein-coding genes annotation of the genome assembly.
Fig. 7
figure 7

Annotation quality comparison of protein-coding genes of closely related species. win = 30 bp: (a) gene length distribution (b) CDS length distribution (c) Intron length distribution (d) Exon length distribution; The vertical axis represents the percentage of genes of a certain statistical length in the total number of genes; Different colored lines represent different species.

Table 9 Statistics of functional annotation for predicted genes.
Fig. 8
figure 8

Venn diagram of functional annotation based on different databases. The Venn diagram displays the shared and unique genes among the GO, KEGG, KOG, NR, Pfam databases.

Data Records

The genomic Illumina sequencing data and PacBio sequencing data were deposited in the NCBI Sequence Read Archive (SRA) database with the accession numbers SRR2613017067 and SRR2612993468.

And the transcriptomic sequencing data (ONT) were deposited in the SRA database with the accession numbers SRR2613065569.

This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession JAWDJR00000000070. The genome assembly and annotation files are available in the Figshare database71.

Technical Validation

DNA and RNA integrity

The concentration of extracted DNA was detected by Nanodrop (Thermo Fisher Scientific, Model: Nandrop 2000) and Qubit (Invitrogen, Model: QubitTM 3.0 Flurometer), and DNA integrity was detected by agarose gel electrophoresis (Electrophoresis apparatus: Tanon, Model: EPS600; Electrophoresis chamber: TIANGEN Biochemical Technology (Beijing), Model: HE-120). DNA samples with high quality (concentration ≥ 50 (ng/μl), OD260/280 = 1.8~2.2, OD260/230 = 1.8~2.5, Fragment size ≥ 23 K, ratio of Nanodrop concentration to Qubit concentration N/Q = 0.9 ~ 2.0) were used to construct sequencing library.

Nanodrop (Model: Nandrop 2000) and Qubit (Invitrogen, Model: QubitTM 3.0 Fluorescence) were used to detect the concentration of extracted RNA, and Agient (Agilent Technologies, Model: Agient 2100), LabChip GX and agarose gel diagram were used to detect the integrity of RNA. RNA samples meeting high-quality requirements (total amount ≥ 3 (μg), meeting three times of library construction, concentration ≥ 40 (ng/μl), volume ≥ 10 (μl), OD260/280 = 1.7~2.5, OD260/230 = 0.5~2.5, 260 nm absorption peak showing normal, RIN ≥ 8) were used to construct a sequencing library.

Genome assembly and annotation evaluation

In our research, BUSCO software (v5.2. 1, odb10)35 was utilized to evaluate the completeness of genome assembly and annotation based on single copy homologous gene alignment (in the actinopterygii_odb10 database)72 by using assembled genome sequences and annotated CDS sequences, respectively.

The result of completeness of the genome assembly indicated that it covered 98.3% of complete BUSCOs (including 3,579 genes) and 96.3% of single-copy BUSCOs (including 3,505 genes) of 3,640 single-copy orthologues. The percentage of complete and duplicated genes were only 2% (including 74 genes). In addition, 0.6% were fragmented (including 23 genes) and 1.0% (including 38 genes) were missing the assembled genome (Table 5).

BUSCO completeness assessment using the annotated CDS sequences showed that the genome annotation covered 99.2% (including 3,610 genes) of the complete BUSCOs. The percentage of complete and single-copy genes were 92% (including 3,349 genes) and the duplicated genes were 7.2% (including 261 genes). The fragment and missing genes were 0.4% (including 13 genes) and 0.5% (including 17 genes), respectively (Table 5).

Ethics declarations

This work was approved by the Bioethical Committee of Freshwater Fisheries Research Center (FFRC) of the Chinese Academy of Fishery Sciences (CAFS). All the methods used in this study were carried out following approved guidelines.