Background & Summary

Anthocidaris crassispina, known internationally as Heliocidaris crassispina, belongs to the genus Anthocidaris of Echinodermata, Echinoidea, Camarodonta, and Echinometridae and is commonly known as the fine-thorned sea urchin. Subtropical marine echinoderms are distributed on the eastern coast of China but only on the southern coast of Japan1. Sea urchins have the advantages of a delicious gonad taste, high nutritional value2, and high medicinal value3, among others, and their daily market demand is increasing. The wild sea urchin resources in natural sea areas have been overexploited and are declining daily, which has greatly promoted the development of the sea urchin breeding industry. At present, artificial breeding is the main mode of breeding for sea urchins4. Scholars at home and abroad have performed much research on food nutrition and breeding techniques in the process of artificial breeding of sea urchins5, but artificial breeding technology for A. crassispina is not as mature as that for other sea urchins. A. crassispina are dioecious and exhibit abnormal development during the growing period. Each individual metamorphoses into a young sea urchin from a planktonic larva, and it takes 1 to 2 years to achieve sexual maturity and reproduction. The artificial breeding of sea urchins is not only limited by technology but is also affected by the gonadal development of sea urchins. In the coastal areas of southeastern China, the most suitable breeding time for A. crassispina is from May to July, and the sea urchin industry is thus highly seasonal. Manually controlled sexual maturity could have important economic implications. Compared with other sea urchins6,7,8, A. crassispina has been the focus of few studies, and we therefore know little about it. Due to the lack of complete genomic information, molecular research on A. crassispina is greatly hindered.

In this study, A. crassispina was used as the research material, and second- and third-generation sequencing techniques and Hi-C technology were used to assemble a high-quality A. crassispina genome. Based on this genome, we can analyse the genetic evolution of A. crassispina populations, confirm its intraspecific evolution and explore its transmission route and historical origin at home and abroad. The genome can also be used to preliminarily detect the gene regions that were altered during the historical domestication process and clarify the domestication and evolution of A. crassispina from a wild to cultivated species at the gene level. Moreover, the sites affected by domestication and the transcriptome can be jointly analysed to explore the gene regions related to the traits of A. crassispina and to provide important resources for species improvement. Therefore, the high-quality reference genome constructed in this study will aid in the exploration of key genome-specific favourable genes or variations in A. crassispina and provide new key gene resources for the improvement of A. crassispina. These findings can also be used for the development of molecular markers, gene mapping and mining, and molecular-assisted breeding at the whole-genome level and thus promote the molecular genetic improvement of A. crassispina.

Method

Sample information and collection

In this study, wild adult A. crassispina (Fig. 1a,b), which were caught locally in Shenzhen city, Guangdong Province, China, on April 29, 2023, were temporarily reared at the Shenzhen Experimental Base of the South China Sea Fisheries Research Institute, Chinese Academy of Fisheries Sciences (average temperature: 26 °C, dissolved oxygen: 6.45 mg/L, pH: 8.13, salinity: 31.42‰) for 10 days. We collected the gonads, intestines and tube feet of the individuals for genome sequencing and annotation. A total of two purple sea urchins were collected, one of which was used in the experiment, and the other was used as a spare sample. In this study, the collection of sea urchins was conducted in accordance with the guidelines and regulations established by the Animal Ethics and Utilization Committee of the South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, under approval number nhdf2023-16. All the samples were freshly frozen in liquid nitrogen and then stored at −80 °C until use.

Fig. 1
figure 1

Photographs of the front (a) and back (b) of A. crassispina.

DNA extraction, detection and sequencing

We used the SDS method to extract genomic DNA from gonadal tissue. The specific steps were as follows: the tissue sample was ground with liquid nitrogen and packed into a centrifuge tube, SDS solution was added, the mixture was incubated in a water bath, NaCl solution was added, the mixture was centrifuged, isoamyl alcohol (24:1) was added to the upper supernatant, the samples were centrifuged again, isopropanol was added to the upper supernatant, the samples were centrifuged a third time, the supernatant was discarded, 75% ethanol was added, the samples were centrifuged a fourth time, and the supernatant was discarded. After drying, TE was added to dissolve the DNA, and the DNA was stored in a freezer at −20 °C. The extracted DNA was detected using Nanodrop (manufacturer: Thermo Gene Company, model: NANODROP2000) and Qubit (manufacturer: Invitrogen, model: QubitTM3Flurometer) instruments, and its integrity was tested by performing agarose gel electrophoresis (electrophoresis instrument manufacturer: Tianneng Tanon, model: EPS600; electrophoresis tank manufacturer: Tiangen Biochemical Technology (Beijing) Co., Ltd., model: HE-120). A PacBio binding kit (PacBio, USA) was employed to ligate the library with primers (PacBio, USA) and polymerase (PacBio, USA). The resultant reaction products were subsequently purified using AMpure PB Beads (PacBio, USA) and sequenced on a Sequel II sequencer (PacBio, USA). The PacBio library was constructed using the genomic DNA from the samples, and approximately 17.96 Gb of clean data were obtained by sequencing. The total sequencing depth was approximately 25X, the read N50 was 18.38 kb, and the average read length was 17.12 kb. The clean data obtained by filtering out low-quality data contained a total of 1,049,106 reads, and the reads were counted according to different gradient length distributions (Table 1).

Table 1 Read length distribution statistics.

Genome assembly

High-accuracy HiFi data were assembled using Hifiasm software (version 0.19)9. Hifiasm assembly can be divided into three main steps. The first step is to identify and correct haplotype errors. Although HiFi reads are very accurate, some errors remain. Hifiasm reads all HiFi reads into memory for all vs. all comparisons and error corrections. Based on the overlapping information between reads, if a base was present in a read that was different from other bases and had at least three supporting reads, it was considered a single nucleotide polymorphism (SNP) and was retained; otherwise, it was regarded as an error and corrected. Notably, Hifiasm uses only the same haplotype data for error correction to avoid overcorrection and retain heterozygous variation information from different haplotypes. In this step, Hifiasm can phase the hybrid SNPs. The second step is the construction of the assembly diagram. After correction, most of the errors are removed, and the heterozygous variation information is retained. Based on this information, Hifiasm constructs a phase string graph with reads as vertices and overlap regions as edges. In contrast to the general third-generation data assembly, the string graph generated via Hifiasm will retain all the bubbles, thus retaining all the haplotype information in the genome for subsequent haplotype processing. The third step is the generation of the assembly sequence. If no other data are available, Hifiasm will randomly select one side of each bubble to output the main assembly results (primary contigs). The final version of the genome was obtained by removing the assembled plastid sequence and applying Hi-C with BLAST alignment to the plastid library. The total length of the assembled genome contig sequence was 891.02 Mb, and the contig N50 was 808.15 Kb.

Hi-C library construction

Hi-C10,11 is a technique for capturing the chromosomal conformation combined with high-throughput sequencing that mainly reconstructs the three-dimensional structure of chromosomes. Through the capture and sequencing of the interactions between all the DNA fragments in the chromosome, information on interactions between the segments of the genome is obtained, thus assisting in the assembly of the genome. The type of Hi-C used in the Hi-C library sequencing experiment was in situ Hi-C, which mainly includes cell cross-linking, endonuclease digestion, terminal end repair, cyclization, DNA purification and capture and computer sequencing12. The specific operations are described below. (1) Cell cross-linking: Formaldehyde was used to fix the sample, and intracellular proteins were cross-linked with DNA, preserving their interactions and maintaining the intracellular 3D structure. (2) Endonuclease digestion: DNA was digested by restriction endonucleases to produce sticky ends on both sides of the cross-link. The most commonly used restriction endonuclease is DpnII. (3) End repair: Using the terminal repair mechanism, biotin-labelled bases were introduced to facilitate subsequent DNA purification and capture. (4) Cyclization of the DNA after terminal repair: Cy DNA fragments exhibiting interactions were cyclized to ensure that the location of the interacting DNA was determined during subsequent sequencing and analysis. (5) DNA purification and capture: Cross-links in DNA were reversed, and the DNA was purified and broken into 300 bp-700 bp fragments. The DNA fragments containing interactions were captured using streptavidin magnetic beads, and a library was constructed. After the construction of the library, the concentration of the library and the size of the inserted fragments were detected using the Qubit2.0 and Agilent2100 instruments, respectively, and the effective concentration of the library was quantified accurately by qPCR to ensure the quality of the library. After passing the library inspection, high-throughput sequencing was performed on the Illumina platform, and the sequencing read length was PE150. A Hi-C library was constructed and sequenced, and approximately 68.61 Gb of data were obtained.

Hi-C-assisted genome assembly

The genome sequence was divided, sequenced and oriented with LACHESIS13 software, and then the chromosome-level genome was obtained by manual mapping and inspection. First, Hi-C error correction was performed. Specifically, the contig version of the genome was interrupted into equal lengths (50 kb) and reassembled using Hi-C. The fragments that could not be restored to their location in the original assembly sequence were considered candidate error regions, and low Hi-C coverage in these regions indicated the error point, enabling complete error correction of the preliminary genome assembly. Second, Hi-C assembly was performed. Specifically, after error correction, the genome was assembled using LACHESIS software. After Hi-C assembly and manual adjustment of the heatmap, the genomic sequences of the common 886.72 Mb sequence were assigned to 21 chromosomes, accounting for 99.52% of the total length. Among the sequences located on chromosomes, the total length for which the order and direction could be determined was 826.82 Mb, accounting for 93.24% of the total length of the located chromosomal sequence. The detailed distribution of each chromosomal sequence is shown in Table 2.

Table 2 Hi-C assembly data.

After the Hi-C error correction, auxiliary chromosome mounting and deredundancy steps, we obtained the final version of the genome for our project. The assembly results showed contig N50 of 0.81 Mb and scaffold N50 of 37.61 Mb. We used the circlize package to construct a loop map, which mainly shows the gene density, repeat sequence density, GC content and collinearity (Fig. 2).

Fig. 2
figure 2

Genome circle map.

Repeat sequence annotation

The repeat sequences included mainly tandem repeat sequences and scattered repeat sequences. The second type mainly included transposable elements (TEs), which were the main focus of our study. Because the conservation of repeat sequences among species is relatively low, a specific repeat sequence database must be built for the prediction of repeat sequences in a given species. We first used RepeatModeller214 (v2.0.1), which includes two ab initio prediction software programs, RECON15 (v1.0.8) and RepeatScout16 (v1.0.6), for ab initio prediction and used the Repeat Classifier to classify the prediction results with the help of the Dfam (v3.5) database. Second, we used LTR_retriever17 (2.9.0) to make ab initio predictions of LTRs, which were mainly based on the prediction results from LTRharvest18 (v1.5.10) and LTR_FINDER19 (v1.07). Then, the above ab initio prediction results and the known database were combined to obtain the species-specific repeat sequence database. Finally, RepeatMasker20 (v4.1.2) was used to predict the TEs in the genome based on the constructed repeat sequence database. Finally, a total TE sequence of approximately 306,462,727 bp was obtained, accounting for 34.39% of the genome sequence. The specific prediction results are shown in Table 3.

Table 3 Statistics for TE sequences.

The tandem repeat sequences were mainly predicted using the MIcroSAtellite identification tool21 (MISA v2.1) and Tandem Repeat Finder22 (TRF, version 409, parameter: 2 7 7 80 10 50 500 -d -h). Finally, a tandem repeat sequence of approximately 67,813,157 bp was obtained, accounting for 7.61% of the total genome sequence. The specific prediction results are shown in Table 4 below.

Table 4 Statistical information on tandem repeat sequences.

Coding gene prediction

Generally, three methods are used to predict genes: homology prediction, ab initio prediction and transcriptome prediction. Specifically, Augustus23 (v3.1.0) and SNAP24 (2006-07-28) were used for ab initio prediction, GeMoMa25 (v1.7) was used for predictions based on homologous species, and the second-generation transcripts used for prediction mainly included transcripts assembled in three ways. One way was to use Hisat26 (v2.1.0), Stringtie27 (v2.1.4) and GeneMarkS-T28 (v5.1) for gene prediction. The second step involved the use of Trinity29 (v2.11) to assemble transcripts and then PASA30 (v2.4.1) for gene prediction, and the third-generation transcriptional group approach involved the use of gmap (2020-06-30) for comparisons after a series of splicing sites and then using PASA (v2.4.1) for gene prediction. Finally, the prediction results from the above three methods were integrated by EVM31 (v1.1.1) and modified by PASA (v2.4.1). Finally, we obtained 28,966 genes and counted the number of genes integrated by EVM with support from the three prediction methods (Fig. 3). Most of the genes were supported by at least two prediction methods, indicating that the prediction quality was high.

Fig. 3
figure 3

Map showing the distribution of integrated genes derived from three prediction methods.

Noncoding RNA prediction

Noncoding RNAs include microRNAs, rRNAs, tRNAs and RNAs with other known functions. According to the structural characteristics of different noncoding RNAs, different strategies were used to predict them; tRNAscan-SE32 (v1.3.1) was used to identify tRNAs, rRNAs were mainly predicted using barrnap33 (v0.9), while miRNAs, snoRNAs and snRNAs were predicted using the Rfam34 (v14.5) database and Infenal35 (v1.1). A total of 8,855 tRNAs, 602 rRNAs, and 86 miRNAs were predicted.

Pseudogene annotation

Pseudogenes have sequences similar to those of functional genes but lose their original function due to insertions, deletions and other mutations. Through GenBlastA36 (v1.0.4) alignment, homologous gene sequences (possible genes) were detected in the gene-masked genome, and then immature stop codons and frameshift mutations in the gene sequence were identified using GeneWise37 (v2.4.1).

Gene function annotation

The predicted gene sequences were annotated against the NR, eggNOG38, GO, KEGG39, TrEMBL40, KOG, SWISS-PROT40 and Pfam41 databases. A total of 94.58% of the genes could be annotated against the databases, and the functional annotation statistics are shown in Table 5.

Table 5 Statistical information from gene function annotations.

Version 5.0 of the Evolutionary Genealogy of Genes: Nonsupervised Orthologous Groups (eggNOG) database contains whole-genome protein sequences from 5090 organisms (477 eukaryotes, 4,445 representative bacteria and 168 archaea) and 2,502 viruses and is an extension of the NCBI COG/KOG database. eggNOG uses 20 functional categories introduced by COG, KOG and arCOG to classify genes at the functional level. In different functional categories, the number of genes can, to a certain extent, reflect the adaptation of the species to the environment, which can be combined with the biological traits or living environment of the research object to provide a scientific explanation. The statistical results of the eggNOG annotation classification are shown in Fig. 4.

Fig. 4
figure 4

Graph of the eggNOG functional annotation results.

Data Records

The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive (Genomics, Proteomics & Bioinformatics 2021) of the National Genomics Data Center (Nucleic Acids Res 2022), China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA01410842), which are publicly accessible at https://ngdc.cncb.ac.cn/gsa43,44. The whole-genome sequence data reported in this paper have been deposited in the Genome Warehouse45 of the National Genomics Data Center44, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under accession number GWHERBV0000000046, which is publicly accessible at https://ngdc.cncb.ac.cn/gwh. The assembled genome of A. crassispina has been deposited in NCBI under accession PRJNA112842547.

Technical Validation

Evaluation of the genome assembly

Core Eukaryotic Genes Mapping Approach48 (CEGMA) was used to evaluate conserved genes (458 genes) in eukaryotic model organisms to construct a core gene library, which was then combined with tBLASTn, GeneWise, geneid and other software to evaluate the integrity of the assembled genome. The CEGMA integrity of this biological sample was 91.48%.

BUSCOv5.2.149 was used to construct a set of single-copy genes representing several large evolutionary branches based on the linear homology database OrthoDB10. The gene set was compared with the assembled genome and evaluated according to the proportion and integrity of the alignment. The greater the integrity of the genome assembly was, the greater the proportion of complete BUSCOs detected. For nonhighly repetitive and polyploid genomes, the proportion of complete and duplicated BUSCOs detected should not be too high. The metazoan database OrthoDB10 was selected, and the integrity of the BUSCO evaluation was 96.96%. The short sequences obtained using second-generation high-throughput sequencing (such as Illumina sequencing) were compared with the assembled genome with BWA software to evaluate the integrity of the assembly and the uniformity of sequencing coverage. The integrity of the assembled genome and the uniformity of sequencing coverage can be evaluated by calculating the statistical comparison rate, the proportion of genome coverage and the depth distribution. The second-generation read return ratio was 98.80%, and the coverage was 99.69. The average sequencing depth was 35. The HiFi reads were compared with the assembled genome using Minimap2 software to evaluate the integrity of the assembly and the uniformity of sequencing coverage. The third-generation read match ratio was 99.03%, the coverage was 99.98, and the average sequencing depth was 17.

HiFi reads were compared to the assembly results to obtain the coverage depth of each site in the genome. Then, a window with a size of 10 kb was slid continuously along the genome without overlap (if the length of the sequence was less than 10 kb, the true length was selected), and the average sequencing depth (the sum of the sequencing depths of all sites in the window/the size of the window) and the percentage of GC content in the window were calculated. Finally, a density map of the contig GC content distribution vs. sequencing depth distribution was drawn according to the statistical data (Fig. 5).

Fig. 5
figure 5

GC content and depth distribution map.

Quality evaluation of the Hi-C library

HiC-Pro50 (v2.10.0) is software that filters and evaluates Hi-C data. It can identify valid interaction pairs and invalid interaction pairs in the Hi-C sequencing results by analysing the comparison results and supports an evaluation of the Hi-C library quality. The reads compared to the specified assembled genome are called mapped reads, and the alignment efficiency refers to the percentage of mapped reads among the clean reads, which represents the utilization of the sequencing data. Comparison efficiency is affected not only by the quality of the data but also by the quality of the designated genome assembly. Using bwa51 (version: 0.7.17murr1188; alignment mode: other parameters of aln; default), the two-terminal sequencing data were compared with the sequences of the assembled genome. A total of 105,396,342 pairs of reads were obtained, of which 58,164,616 pairs were valid Hi-C data, accounting for 55.19% of the data aligned to the genome.

Evaluation of the Hi-C assembly results

We determined the grouping of chromosomes by cutting the genome of Hi-C assembled into chromosomes 300000 bp and a bin, and then the number of covered Hi-C ReadPairs between any two bins was used as a signal of the intensity of interaction between the two bins to construct a heatmap (Fig. 6). According to the heatmap, 21 chromosome groups can be clearly distinguished; within each group, the intensity of interaction in the diagonal position is greater than that in the nondiagonal position, indicating that the interaction intensity between adjacent sequences in the Hi-C assembly is high, while the strength of the interaction signal between nonadjacent sequences is weak, which is consistent with the principle of Hi-C-assisted genome assembly, which proves that the effect of genome assembly is better.

Fig. 6
figure 6

Heatmap of chromosome interactions in the Hi-C assembly.

We used collinear analysis, specifically diamond52 (v0.9.29.130), as a method to compare the gene sequences of A. crassispina and Heliocidaris tuberculata53 to determine similar gene pairs and to better verify the accuracy of the genome assembly. Then, according to the gff3 file, we determined whether similar gene pairs were adjacent on the chromosome. This process was performed mainly using MCScanX54, and finally, all the genes in the collinear blocks were obtained. A collinear map of the linear patterns of the two species was drawn with JCVI55 (v0.9.13) (Fig. 7).

Fig. 7
figure 7

Collinearity graph.

Predictive evaluation of coding genes

The metazoan database in BUSCO49 contains 925 complete BUSCOs, accounting for 96.96%. We used BUSCO (v5.2.2) software to evaluate the integrity of the gene predictions, which included 2 fragmented BUSCOs (0.21%) and 27 missing BUSCOs (2.83%), indicating that the integrity of the gene predictions was high.