Abstract
Anthocidaris crassispina is a very popular edible sea urchin distributed along the coast of the South China Sea. In this study, we performed whole-genome sequencing and generated a chromosome-level assembly of this species. The total length of the genomic contig sequence was 891.02 Mb, and contig N50 was 808.15 kb when Hifiasm was used for assembly. The Hi-C library was constructed and sequenced, yielding approximately 68.61 Gb of data. After Hi-C assembly, approximately 886.72 Mb of sequence was able to be mapped onto 21 chromosomes, accounting for 99.52% of the total genome length. Among the sequences located on the chromosomes, those for which the order and direction could be determined accounted for approximately 826.82 Mb, or 93.24% of the total length. These results provide valuable resources for further study of A. crassispina at the genetic level.
Similar content being viewed by others
Background & Summary
Anthocidaris crassispina, known internationally as Heliocidaris crassispina, belongs to the genus Anthocidaris of Echinodermata, Echinoidea, Camarodonta, and Echinometridae and is commonly known as the fine-thorned sea urchin. Subtropical marine echinoderms are distributed on the eastern coast of China but only on the southern coast of Japan1. Sea urchins have the advantages of a delicious gonad taste, high nutritional value2, and high medicinal value3, among others, and their daily market demand is increasing. The wild sea urchin resources in natural sea areas have been overexploited and are declining daily, which has greatly promoted the development of the sea urchin breeding industry. At present, artificial breeding is the main mode of breeding for sea urchins4. Scholars at home and abroad have performed much research on food nutrition and breeding techniques in the process of artificial breeding of sea urchins5, but artificial breeding technology for A. crassispina is not as mature as that for other sea urchins. A. crassispina are dioecious and exhibit abnormal development during the growing period. Each individual metamorphoses into a young sea urchin from a planktonic larva, and it takes 1 to 2 years to achieve sexual maturity and reproduction. The artificial breeding of sea urchins is not only limited by technology but is also affected by the gonadal development of sea urchins. In the coastal areas of southeastern China, the most suitable breeding time for A. crassispina is from May to July, and the sea urchin industry is thus highly seasonal. Manually controlled sexual maturity could have important economic implications. Compared with other sea urchins6,7,8, A. crassispina has been the focus of few studies, and we therefore know little about it. Due to the lack of complete genomic information, molecular research on A. crassispina is greatly hindered.
In this study, A. crassispina was used as the research material, and second- and third-generation sequencing techniques and Hi-C technology were used to assemble a high-quality A. crassispina genome. Based on this genome, we can analyse the genetic evolution of A. crassispina populations, confirm its intraspecific evolution and explore its transmission route and historical origin at home and abroad. The genome can also be used to preliminarily detect the gene regions that were altered during the historical domestication process and clarify the domestication and evolution of A. crassispina from a wild to cultivated species at the gene level. Moreover, the sites affected by domestication and the transcriptome can be jointly analysed to explore the gene regions related to the traits of A. crassispina and to provide important resources for species improvement. Therefore, the high-quality reference genome constructed in this study will aid in the exploration of key genome-specific favourable genes or variations in A. crassispina and provide new key gene resources for the improvement of A. crassispina. These findings can also be used for the development of molecular markers, gene mapping and mining, and molecular-assisted breeding at the whole-genome level and thus promote the molecular genetic improvement of A. crassispina.
Method
Sample information and collection
In this study, wild adult A. crassispina (Fig. 1a,b), which were caught locally in Shenzhen city, Guangdong Province, China, on April 29, 2023, were temporarily reared at the Shenzhen Experimental Base of the South China Sea Fisheries Research Institute, Chinese Academy of Fisheries Sciences (average temperature: 26 °C, dissolved oxygen: 6.45 mg/L, pH: 8.13, salinity: 31.42‰) for 10 days. We collected the gonads, intestines and tube feet of the individuals for genome sequencing and annotation. A total of two purple sea urchins were collected, one of which was used in the experiment, and the other was used as a spare sample. In this study, the collection of sea urchins was conducted in accordance with the guidelines and regulations established by the Animal Ethics and Utilization Committee of the South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, under approval number nhdf2023-16. All the samples were freshly frozen in liquid nitrogen and then stored at −80 °C until use.
DNA extraction, detection and sequencing
We used the SDS method to extract genomic DNA from gonadal tissue. The specific steps were as follows: the tissue sample was ground with liquid nitrogen and packed into a centrifuge tube, SDS solution was added, the mixture was incubated in a water bath, NaCl solution was added, the mixture was centrifuged, isoamyl alcohol (24:1) was added to the upper supernatant, the samples were centrifuged again, isopropanol was added to the upper supernatant, the samples were centrifuged a third time, the supernatant was discarded, 75% ethanol was added, the samples were centrifuged a fourth time, and the supernatant was discarded. After drying, TE was added to dissolve the DNA, and the DNA was stored in a freezer at −20 °C. The extracted DNA was detected using Nanodrop (manufacturer: Thermo Gene Company, model: NANODROP2000) and Qubit (manufacturer: Invitrogen, model: QubitTM3Flurometer) instruments, and its integrity was tested by performing agarose gel electrophoresis (electrophoresis instrument manufacturer: Tianneng Tanon, model: EPS600; electrophoresis tank manufacturer: Tiangen Biochemical Technology (Beijing) Co., Ltd., model: HE-120). A PacBio binding kit (PacBio, USA) was employed to ligate the library with primers (PacBio, USA) and polymerase (PacBio, USA). The resultant reaction products were subsequently purified using AMpure PB Beads (PacBio, USA) and sequenced on a Sequel II sequencer (PacBio, USA). The PacBio library was constructed using the genomic DNA from the samples, and approximately 17.96 Gb of clean data were obtained by sequencing. The total sequencing depth was approximately 25X, the read N50 was 18.38 kb, and the average read length was 17.12 kb. The clean data obtained by filtering out low-quality data contained a total of 1,049,106 reads, and the reads were counted according to different gradient length distributions (Table 1).
Genome assembly
High-accuracy HiFi data were assembled using Hifiasm software (version 0.19)9. Hifiasm assembly can be divided into three main steps. The first step is to identify and correct haplotype errors. Although HiFi reads are very accurate, some errors remain. Hifiasm reads all HiFi reads into memory for all vs. all comparisons and error corrections. Based on the overlapping information between reads, if a base was present in a read that was different from other bases and had at least three supporting reads, it was considered a single nucleotide polymorphism (SNP) and was retained; otherwise, it was regarded as an error and corrected. Notably, Hifiasm uses only the same haplotype data for error correction to avoid overcorrection and retain heterozygous variation information from different haplotypes. In this step, Hifiasm can phase the hybrid SNPs. The second step is the construction of the assembly diagram. After correction, most of the errors are removed, and the heterozygous variation information is retained. Based on this information, Hifiasm constructs a phase string graph with reads as vertices and overlap regions as edges. In contrast to the general third-generation data assembly, the string graph generated via Hifiasm will retain all the bubbles, thus retaining all the haplotype information in the genome for subsequent haplotype processing. The third step is the generation of the assembly sequence. If no other data are available, Hifiasm will randomly select one side of each bubble to output the main assembly results (primary contigs). The final version of the genome was obtained by removing the assembled plastid sequence and applying Hi-C with BLAST alignment to the plastid library. The total length of the assembled genome contig sequence was 891.02 Mb, and the contig N50 was 808.15 Kb.
Hi-C library construction
Hi-C10,11 is a technique for capturing the chromosomal conformation combined with high-throughput sequencing that mainly reconstructs the three-dimensional structure of chromosomes. Through the capture and sequencing of the interactions between all the DNA fragments in the chromosome, information on interactions between the segments of the genome is obtained, thus assisting in the assembly of the genome. The type of Hi-C used in the Hi-C library sequencing experiment was in situ Hi-C, which mainly includes cell cross-linking, endonuclease digestion, terminal end repair, cyclization, DNA purification and capture and computer sequencing12. The specific operations are described below. (1) Cell cross-linking: Formaldehyde was used to fix the sample, and intracellular proteins were cross-linked with DNA, preserving their interactions and maintaining the intracellular 3D structure. (2) Endonuclease digestion: DNA was digested by restriction endonucleases to produce sticky ends on both sides of the cross-link. The most commonly used restriction endonuclease is DpnII. (3) End repair: Using the terminal repair mechanism, biotin-labelled bases were introduced to facilitate subsequent DNA purification and capture. (4) Cyclization of the DNA after terminal repair: Cy DNA fragments exhibiting interactions were cyclized to ensure that the location of the interacting DNA was determined during subsequent sequencing and analysis. (5) DNA purification and capture: Cross-links in DNA were reversed, and the DNA was purified and broken into 300 bp-700 bp fragments. The DNA fragments containing interactions were captured using streptavidin magnetic beads, and a library was constructed. After the construction of the library, the concentration of the library and the size of the inserted fragments were detected using the Qubit2.0 and Agilent2100 instruments, respectively, and the effective concentration of the library was quantified accurately by qPCR to ensure the quality of the library. After passing the library inspection, high-throughput sequencing was performed on the Illumina platform, and the sequencing read length was PE150. A Hi-C library was constructed and sequenced, and approximately 68.61 Gb of data were obtained.
Hi-C-assisted genome assembly
The genome sequence was divided, sequenced and oriented with LACHESIS13 software, and then the chromosome-level genome was obtained by manual mapping and inspection. First, Hi-C error correction was performed. Specifically, the contig version of the genome was interrupted into equal lengths (50 kb) and reassembled using Hi-C. The fragments that could not be restored to their location in the original assembly sequence were considered candidate error regions, and low Hi-C coverage in these regions indicated the error point, enabling complete error correction of the preliminary genome assembly. Second, Hi-C assembly was performed. Specifically, after error correction, the genome was assembled using LACHESIS software. After Hi-C assembly and manual adjustment of the heatmap, the genomic sequences of the common 886.72 Mb sequence were assigned to 21 chromosomes, accounting for 99.52% of the total length. Among the sequences located on chromosomes, the total length for which the order and direction could be determined was 826.82 Mb, accounting for 93.24% of the total length of the located chromosomal sequence. The detailed distribution of each chromosomal sequence is shown in Table 2.
After the Hi-C error correction, auxiliary chromosome mounting and deredundancy steps, we obtained the final version of the genome for our project. The assembly results showed contig N50 of 0.81 Mb and scaffold N50 of 37.61 Mb. We used the circlize package to construct a loop map, which mainly shows the gene density, repeat sequence density, GC content and collinearity (Fig. 2).
Repeat sequence annotation
The repeat sequences included mainly tandem repeat sequences and scattered repeat sequences. The second type mainly included transposable elements (TEs), which were the main focus of our study. Because the conservation of repeat sequences among species is relatively low, a specific repeat sequence database must be built for the prediction of repeat sequences in a given species. We first used RepeatModeller214 (v2.0.1), which includes two ab initio prediction software programs, RECON15 (v1.0.8) and RepeatScout16 (v1.0.6), for ab initio prediction and used the Repeat Classifier to classify the prediction results with the help of the Dfam (v3.5) database. Second, we used LTR_retriever17 (2.9.0) to make ab initio predictions of LTRs, which were mainly based on the prediction results from LTRharvest18 (v1.5.10) and LTR_FINDER19 (v1.07). Then, the above ab initio prediction results and the known database were combined to obtain the species-specific repeat sequence database. Finally, RepeatMasker20 (v4.1.2) was used to predict the TEs in the genome based on the constructed repeat sequence database. Finally, a total TE sequence of approximately 306,462,727 bp was obtained, accounting for 34.39% of the genome sequence. The specific prediction results are shown in Table 3.
The tandem repeat sequences were mainly predicted using the MIcroSAtellite identification tool21 (MISA v2.1) and Tandem Repeat Finder22 (TRF, version 409, parameter: 2 7 7 80 10 50 500 -d -h). Finally, a tandem repeat sequence of approximately 67,813,157 bp was obtained, accounting for 7.61% of the total genome sequence. The specific prediction results are shown in Table 4 below.
Coding gene prediction
Generally, three methods are used to predict genes: homology prediction, ab initio prediction and transcriptome prediction. Specifically, Augustus23 (v3.1.0) and SNAP24 (2006-07-28) were used for ab initio prediction, GeMoMa25 (v1.7) was used for predictions based on homologous species, and the second-generation transcripts used for prediction mainly included transcripts assembled in three ways. One way was to use Hisat26 (v2.1.0), Stringtie27 (v2.1.4) and GeneMarkS-T28 (v5.1) for gene prediction. The second step involved the use of Trinity29 (v2.11) to assemble transcripts and then PASA30 (v2.4.1) for gene prediction, and the third-generation transcriptional group approach involved the use of gmap (2020-06-30) for comparisons after a series of splicing sites and then using PASA (v2.4.1) for gene prediction. Finally, the prediction results from the above three methods were integrated by EVM31 (v1.1.1) and modified by PASA (v2.4.1). Finally, we obtained 28,966 genes and counted the number of genes integrated by EVM with support from the three prediction methods (Fig. 3). Most of the genes were supported by at least two prediction methods, indicating that the prediction quality was high.
Noncoding RNA prediction
Noncoding RNAs include microRNAs, rRNAs, tRNAs and RNAs with other known functions. According to the structural characteristics of different noncoding RNAs, different strategies were used to predict them; tRNAscan-SE32 (v1.3.1) was used to identify tRNAs, rRNAs were mainly predicted using barrnap33 (v0.9), while miRNAs, snoRNAs and snRNAs were predicted using the Rfam34 (v14.5) database and Infenal35 (v1.1). A total of 8,855 tRNAs, 602 rRNAs, and 86 miRNAs were predicted.
Pseudogene annotation
Pseudogenes have sequences similar to those of functional genes but lose their original function due to insertions, deletions and other mutations. Through GenBlastA36 (v1.0.4) alignment, homologous gene sequences (possible genes) were detected in the gene-masked genome, and then immature stop codons and frameshift mutations in the gene sequence were identified using GeneWise37 (v2.4.1).
Gene function annotation
The predicted gene sequences were annotated against the NR, eggNOG38, GO, KEGG39, TrEMBL40, KOG, SWISS-PROT40 and Pfam41 databases. A total of 94.58% of the genes could be annotated against the databases, and the functional annotation statistics are shown in Table 5.
Version 5.0 of the Evolutionary Genealogy of Genes: Nonsupervised Orthologous Groups (eggNOG) database contains whole-genome protein sequences from 5090 organisms (477 eukaryotes, 4,445 representative bacteria and 168 archaea) and 2,502 viruses and is an extension of the NCBI COG/KOG database. eggNOG uses 20 functional categories introduced by COG, KOG and arCOG to classify genes at the functional level. In different functional categories, the number of genes can, to a certain extent, reflect the adaptation of the species to the environment, which can be combined with the biological traits or living environment of the research object to provide a scientific explanation. The statistical results of the eggNOG annotation classification are shown in Fig. 4.
Data Records
The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive (Genomics, Proteomics & Bioinformatics 2021) of the National Genomics Data Center (Nucleic Acids Res 2022), China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA01410842), which are publicly accessible at https://ngdc.cncb.ac.cn/gsa43,44. The whole-genome sequence data reported in this paper have been deposited in the Genome Warehouse45 of the National Genomics Data Center44, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under accession number GWHERBV0000000046, which is publicly accessible at https://ngdc.cncb.ac.cn/gwh. The assembled genome of A. crassispina has been deposited in NCBI under accession PRJNA112842547.
Technical Validation
Evaluation of the genome assembly
Core Eukaryotic Genes Mapping Approach48 (CEGMA) was used to evaluate conserved genes (458 genes) in eukaryotic model organisms to construct a core gene library, which was then combined with tBLASTn, GeneWise, geneid and other software to evaluate the integrity of the assembled genome. The CEGMA integrity of this biological sample was 91.48%.
BUSCOv5.2.149 was used to construct a set of single-copy genes representing several large evolutionary branches based on the linear homology database OrthoDB10. The gene set was compared with the assembled genome and evaluated according to the proportion and integrity of the alignment. The greater the integrity of the genome assembly was, the greater the proportion of complete BUSCOs detected. For nonhighly repetitive and polyploid genomes, the proportion of complete and duplicated BUSCOs detected should not be too high. The metazoan database OrthoDB10 was selected, and the integrity of the BUSCO evaluation was 96.96%. The short sequences obtained using second-generation high-throughput sequencing (such as Illumina sequencing) were compared with the assembled genome with BWA software to evaluate the integrity of the assembly and the uniformity of sequencing coverage. The integrity of the assembled genome and the uniformity of sequencing coverage can be evaluated by calculating the statistical comparison rate, the proportion of genome coverage and the depth distribution. The second-generation read return ratio was 98.80%, and the coverage was 99.69. The average sequencing depth was 35. The HiFi reads were compared with the assembled genome using Minimap2 software to evaluate the integrity of the assembly and the uniformity of sequencing coverage. The third-generation read match ratio was 99.03%, the coverage was 99.98, and the average sequencing depth was 17.
HiFi reads were compared to the assembly results to obtain the coverage depth of each site in the genome. Then, a window with a size of 10 kb was slid continuously along the genome without overlap (if the length of the sequence was less than 10 kb, the true length was selected), and the average sequencing depth (the sum of the sequencing depths of all sites in the window/the size of the window) and the percentage of GC content in the window were calculated. Finally, a density map of the contig GC content distribution vs. sequencing depth distribution was drawn according to the statistical data (Fig. 5).
Quality evaluation of the Hi-C library
HiC-Pro50 (v2.10.0) is software that filters and evaluates Hi-C data. It can identify valid interaction pairs and invalid interaction pairs in the Hi-C sequencing results by analysing the comparison results and supports an evaluation of the Hi-C library quality. The reads compared to the specified assembled genome are called mapped reads, and the alignment efficiency refers to the percentage of mapped reads among the clean reads, which represents the utilization of the sequencing data. Comparison efficiency is affected not only by the quality of the data but also by the quality of the designated genome assembly. Using bwa51 (version: 0.7.17murr1188; alignment mode: other parameters of aln; default), the two-terminal sequencing data were compared with the sequences of the assembled genome. A total of 105,396,342 pairs of reads were obtained, of which 58,164,616 pairs were valid Hi-C data, accounting for 55.19% of the data aligned to the genome.
Evaluation of the Hi-C assembly results
We determined the grouping of chromosomes by cutting the genome of Hi-C assembled into chromosomes 300000 bp and a bin, and then the number of covered Hi-C ReadPairs between any two bins was used as a signal of the intensity of interaction between the two bins to construct a heatmap (Fig. 6). According to the heatmap, 21 chromosome groups can be clearly distinguished; within each group, the intensity of interaction in the diagonal position is greater than that in the nondiagonal position, indicating that the interaction intensity between adjacent sequences in the Hi-C assembly is high, while the strength of the interaction signal between nonadjacent sequences is weak, which is consistent with the principle of Hi-C-assisted genome assembly, which proves that the effect of genome assembly is better.
We used collinear analysis, specifically diamond52 (v0.9.29.130), as a method to compare the gene sequences of A. crassispina and Heliocidaris tuberculata53 to determine similar gene pairs and to better verify the accuracy of the genome assembly. Then, according to the gff3 file, we determined whether similar gene pairs were adjacent on the chromosome. This process was performed mainly using MCScanX54, and finally, all the genes in the collinear blocks were obtained. A collinear map of the linear patterns of the two species was drawn with JCVI55 (v0.9.13) (Fig. 7).
Predictive evaluation of coding genes
The metazoan database in BUSCO49 contains 925 complete BUSCOs, accounting for 96.96%. We used BUSCO (v5.2.2) software to evaluate the integrity of the gene predictions, which included 2 fragmented BUSCOs (0.21%) and 27 missing BUSCOs (2.83%), indicating that the integrity of the gene predictions was high.
Code availability
All commands and pipelines used for data processing were executed according to the manual and protocols of the corresponding bioinformatics software.
References
Luo, H. X. et al. Analysis of morphological differences of six wild populations of Anthocidaris crassispina in the South China Sea. Guangdong agricultural science 42, 114–119 (2015).
Xu, H. et al. Analysis and evaluation of nutritional components of gonads of two kinds of sea urchin. Journal of Nutrition 40, 307–309 (2018).
Moreno-García, D. M. et al. Sea urchins: An update on their pharmacological properties. PeerJ 10, e13606 (2022).
Yang, Z. W. et al. Study on artificial breeding technique of Anthocidaris crassispina. Taiwan Strait 20, 32–36 (2001).
Feng, Y. Q., Xu, Z. J., Qin, R., Shen, M. H. & Zeng, G. Q. Study on artificial breeding technique of Anthocidaris crassispina. Marine science 30, 5–8 (2006).
Hibino, T. et al. The immune gene repertoire encoded in the purple sea urchin genome. Developmental biology 300, 349–365 (2006).
Rast, J. P., Smith, L. C., Loza-Coll, M., Hibino, T. & Litman, G. W. Genomic insights into the immune system of the sea urchin. Science 314, 952–956 (2006).
Kinjo, S., Kiyomoto, M., Yamamoto, T., Ikeo, K. & Yaguchi, S. HpBase: A genome database of a sea urchin, Hemicentrotus pulcherrimus. Development, Growth & Differentiation 60, 174–182 (2018).
Chakraborty, M., Baldwin-Brown, J. G., Long, A. D. & Emerson, J. Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage. Nucleic acids research 44, e147–e147 (2016).
Oluwadare, O., Highsmith, M. & Cheng, J. An overview of methods for reconstructing 3-D chromosome and genome structures from Hi-C data. Biological procedures online 21, 1–20 (2019).
Ay, F., Bailey, T. L. & Noble, W. S. Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts. Genome research 24, 999–1011 (2014).
Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nature biotechnology 31, 1119–1125 (2013).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences 117, 9451–9457 (2020).
Bao, Z. & Eddy, S. R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome research 12, 1269–1276 (2002).
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358 (2005).
Ou, S. & Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant physiology 176, 1410–1422 (2018).
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC bioinformatics 9, 1–14 (2008).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic acids research 35, W265–W268 (2007).
Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics 5, 4.10. 11–14.10. 14 (2004).
Beier, S., Thiel, T., Münch, T., Scholz, U. & Mascher, M. MISA-web: a web server for microsatellite prediction. Bioinformatics 33, 2583–2585 (2017).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).
Korf, I. Gene finding in novel genomes. BMC bioinformatics 5, 1–9 (2004).
Keilwagen, J. et al. Using intron position conservation for homology-based gene prediction. Nucleic acids research 44, e89–e89 (2016).
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature methods 12, 357–360 (2015).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature biotechnology 33, 290–295 (2015).
Tang, S., Lomsadze, A. & Borodovsky, M. Identification of protein coding regions in RNA transcripts. Nucleic acids research 43, e78–e78 (2015).
Grabherr, M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology 29, 644 (2011).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic acids research 31, 5654–5666 (2003).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome biology 9, 1–22 (2008).
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research 25, 955–964 (1997).
Loman, T. A novel method for predicting ribosomal RNA genes in prokaryotic genomes. (2017).
Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic acids research 33, D121–D124 (2005).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
She, R., Chu, J. S.-C., Wang, K., Pei, J. & Chen, N. GenBlastA: enabling BLAST to identify homologous gene sequences. Genome research 19, 143–149 (2009).
Birney, E., Clamp, M. & Durbin, R. GeneWise and genomewise. Genome research 14, 988–995 (2004).
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic acids research 47, D309–D314 (2019).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic acids research 44, D457–D462 (2016).
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research 31, 365–370 (2003).
Finn, R. D. et al. Pfam: clans, web tools and services. Nucleic acids research 34, D247–D251 (2006).
NGDC Genome Sequence Archive (GSA). https://ngdc.cncb.ac.cn/gsa/browse/CRA014108 (2024)
Chen, T. et al. The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types. Genomics, proteomics & bioinformatics, https://doi.org/10.1016/j.gpb.2021.08.001 (2021).
CNCB-NGDC Members and Partners. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2024. Nucleic acids research 52, D18–d32, https://doi.org/10.1093/nar/gkad1078 (2024).
Chen, M. et al. Genome Warehouse: a public repository housing genome-scale data. Genomics, Proteomics and Bioinformatics 19, 584–589 (2021).
Genome Warehouse(GWH) https://ngdc.cncb.ac.cn/gwh/Assembly/83691/show (2024).
Zhang, J. & Guo, Y. Genbank https://identifiers.org/ncbi/insdc.gca:GCA_040801975.1 (2024).
Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome biology 16, 1–11 (2015).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics 25, 1754–1760 (2009).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature methods 12, 59–60 (2015).
NCBI BioProject: PRJNA827769. “Genome sequencing of Diadema setosum.” Available at: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA827769.
Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic acids research 40, e49–e49 (2012).
Tang, H., Krishnakumar, V., Li, J. & Zhang, X. jcvi: JCVI utility libraries. Zenodo 30, 2015 (2015).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (No. 32160863), South Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences (No. 2023XK01) and Sanya Yazhou Bay Science and Technology City (No. SKJC-2022-PTDX-014). We thank the anonymous reviewers for their helpful comments and suggestions.
Author information
Authors and Affiliations
Contributions
C.Q., Y.G. and J.Z. conceived the study. Y.G. and J.Z. collected the samples, extracted the genomic DNA, and conducted sequencing. J.S., Y.G., J.Z., G.Y. and Z.M. performed the bioinformatics analysis. Y.G. and J.Z. wrote the manuscript. All the authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, J., Guo, Y., Su, J. et al. The first high-quality genome assembly and annotation of Anthocidaris crassispina. Sci Data 11, 866 (2024). https://doi.org/10.1038/s41597-024-03733-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03733-y
- Springer Nature Limited