The first high-quality genome assembly and annotation of Anthocidaris crassispina

Zhang, Jia; Guo, Yu; Su, Jiaqi; Yu, Gang; Ma, Zhenhua; Qin, Chuanxin

doi:10.1038/s41597-024-03733-y

The first high-quality genome assembly and annotation of Anthocidaris crassispina

Data Descriptor
Open access
Published: 10 August 2024

Volume 11, article number 866, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

The first high-quality genome assembly and annotation of Anthocidaris crassispina

Download PDF

Jia Zhang^1,2,3^na1,
Yu Guo^1,2,3,4^na1,
Jiaqi Su^1,3,
Gang Yu^1,3,
Zhenhua Ma^1,3 &
…
Chuanxin Qin ORCID: orcid.org/0000-0002-3073-1563^1,2,3

467 Accesses
Explore all metrics

Abstract

Anthocidaris crassispina is a very popular edible sea urchin distributed along the coast of the South China Sea. In this study, we performed whole-genome sequencing and generated a chromosome-level assembly of this species. The total length of the genomic contig sequence was 891.02 Mb, and contig N50 was 808.15 kb when Hifiasm was used for assembly. The Hi-C library was constructed and sequenced, yielding approximately 68.61 Gb of data. After Hi-C assembly, approximately 886.72 Mb of sequence was able to be mapped onto 21 chromosomes, accounting for 99.52% of the total genome length. Among the sequences located on the chromosomes, those for which the order and direction could be determined accounted for approximately 826.82 Mb, or 93.24% of the total length. These results provide valuable resources for further study of A. crassispina at the genetic level.

A high-quality chromosome-level genome assembly of the topmouth culter (Culter alburnus Basilewsky, 1855)

Article Open access 22 August 2024

A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila)

Article Open access 04 May 2023

Chromosome-level genome assembly of ridgetail white shrimp Exopalaemon carinicauda

Article Open access 04 June 2024

Background & Summary

Anthocidaris crassispina, known internationally as Heliocidaris crassispina, belongs to the genus Anthocidaris of Echinodermata, Echinoidea, Camarodonta, and Echinometridae and is commonly known as the fine-thorned sea urchin. Subtropical marine echinoderms are distributed on the eastern coast of China but only on the southern coast of Japan¹. Sea urchins have the advantages of a delicious gonad taste, high nutritional value², and high medicinal value³, among others, and their daily market demand is increasing. The wild sea urchin resources in natural sea areas have been overexploited and are declining daily, which has greatly promoted the development of the sea urchin breeding industry. At present, artificial breeding is the main mode of breeding for sea urchins⁴. Scholars at home and abroad have performed much research on food nutrition and breeding techniques in the process of artificial breeding of sea urchins⁵, but artificial breeding technology for A. crassispina is not as mature as that for other sea urchins. A. crassispina are dioecious and exhibit abnormal development during the growing period. Each individual metamorphoses into a young sea urchin from a planktonic larva, and it takes 1 to 2 years to achieve sexual maturity and reproduction. The artificial breeding of sea urchins is not only limited by technology but is also affected by the gonadal development of sea urchins. In the coastal areas of southeastern China, the most suitable breeding time for A. crassispina is from May to July, and the sea urchin industry is thus highly seasonal. Manually controlled sexual maturity could have important economic implications. Compared with other sea urchins^6,7,8, A. crassispina has been the focus of few studies, and we therefore know little about it. Due to the lack of complete genomic information, molecular research on A. crassispina is greatly hindered.

In this study, A. crassispina was used as the research material, and second- and third-generation sequencing techniques and Hi-C technology were used to assemble a high-quality A. crassispina genome. Based on this genome, we can analyse the genetic evolution of A. crassispina populations, confirm its intraspecific evolution and explore its transmission route and historical origin at home and abroad. The genome can also be used to preliminarily detect the gene regions that were altered during the historical domestication process and clarify the domestication and evolution of A. crassispina from a wild to cultivated species at the gene level. Moreover, the sites affected by domestication and the transcriptome can be jointly analysed to explore the gene regions related to the traits of A. crassispina and to provide important resources for species improvement. Therefore, the high-quality reference genome constructed in this study will aid in the exploration of key genome-specific favourable genes or variations in A. crassispina and provide new key gene resources for the improvement of A. crassispina. These findings can also be used for the development of molecular markers, gene mapping and mining, and molecular-assisted breeding at the whole-genome level and thus promote the molecular genetic improvement of A. crassispina.

Method

Sample information and collection

In this study, wild adult A. crassispina (Fig. 1a,b), which were caught locally in Shenzhen city, Guangdong Province, China, on April 29, 2023, were temporarily reared at the Shenzhen Experimental Base of the South China Sea Fisheries Research Institute, Chinese Academy of Fisheries Sciences (average temperature: 26 °C, dissolved oxygen: 6.45 mg/L, pH: 8.13, salinity: 31.42‰) for 10 days. We collected the gonads, intestines and tube feet of the individuals for genome sequencing and annotation. A total of two purple sea urchins were collected, one of which was used in the experiment, and the other was used as a spare sample. In this study, the collection of sea urchins was conducted in accordance with the guidelines and regulations established by the Animal Ethics and Utilization Committee of the South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, under approval number nhdf2023-16. All the samples were freshly frozen in liquid nitrogen and then stored at −80 °C until use.

DNA extraction, detection and sequencing

We used the SDS method to extract genomic DNA from gonadal tissue. The specific steps were as follows: the tissue sample was ground with liquid nitrogen and packed into a centrifuge tube, SDS solution was added, the mixture was incubated in a water bath, NaCl solution was added, the mixture was centrifuged, isoamyl alcohol (24:1) was added to the upper supernatant, the samples were centrifuged again, isopropanol was added to the upper supernatant, the samples were centrifuged a third time, the supernatant was discarded, 75% ethanol was added, the samples were centrifuged a fourth time, and the supernatant was discarded. After drying, TE was added to dissolve the DNA, and the DNA was stored in a freezer at −20 °C. The extracted DNA was detected using Nanodrop (manufacturer: Thermo Gene Company, model: NANODROP2000) and Qubit (manufacturer: Invitrogen, model: QubitTM3Flurometer) instruments, and its integrity was tested by performing agarose gel electrophoresis (electrophoresis instrument manufacturer: Tianneng Tanon, model: EPS600; electrophoresis tank manufacturer: Tiangen Biochemical Technology (Beijing) Co., Ltd., model: HE-120). A PacBio binding kit (PacBio, USA) was employed to ligate the library with primers (PacBio, USA) and polymerase (PacBio, USA). The resultant reaction products were subsequently purified using AMpure PB Beads (PacBio, USA) and sequenced on a Sequel II sequencer (PacBio, USA). The PacBio library was constructed using the genomic DNA from the samples, and approximately 17.96 Gb of clean data were obtained by sequencing. The total sequencing depth was approximately 25X, the read N50 was 18.38 kb, and the average read length was 17.12 kb. The clean data obtained by filtering out low-quality data contained a total of 1,049,106 reads, and the reads were counted according to different gradient length distributions (Table 1).

Table 1 Read length distribution statistics.

Full size table

Genome assembly

High-accuracy HiFi data were assembled using Hifiasm software (version 0.19)⁹. Hifiasm assembly can be divided into three main steps. The first step is to identify and correct haplotype errors. Although HiFi reads are very accurate, some errors remain. Hifiasm reads all HiFi reads into memory for all vs. all comparisons and error corrections. Based on the overlapping information between reads, if a base was present in a read that was different from other bases and had at least three supporting reads, it was considered a single nucleotide polymorphism (SNP) and was retained; otherwise, it was regarded as an error and corrected. Notably, Hifiasm uses only the same haplotype data for error correction to avoid overcorrection and retain heterozygous variation information from different haplotypes. In this step, Hifiasm can phase the hybrid SNPs. The second step is the construction of the assembly diagram. After correction, most of the errors are removed, and the heterozygous variation information is retained. Based on this information, Hifiasm constructs a phase string graph with reads as vertices and overlap regions as edges. In contrast to the general third-generation data assembly, the string graph generated via Hifiasm will retain all the bubbles, thus retaining all the haplotype information in the genome for subsequent haplotype processing. The third step is the generation of the assembly sequence. If no other data are available, Hifiasm will randomly select one side of each bubble to output the main assembly results (primary contigs). The final version of the genome was obtained by removing the assembled plastid sequence and applying Hi-C with BLAST alignment to the plastid library. The total length of the assembled genome contig sequence was 891.02 Mb, and the contig N50 was 808.15 Kb.

Hi-C library construction

Hi-C^10,11 is a technique for capturing the chromosomal conformation combined with high-throughput sequencing that mainly reconstructs the three-dimensional structure of chromosomes. Through the capture and sequencing of the interactions between all the DNA fragments in the chromosome, information on interactions between the segments of the genome is obtained, thus assisting in the assembly of the genome. The type of Hi-C used in the Hi-C library sequencing experiment was in situ Hi-C, which mainly includes cell cross-linking, endonuclease digestion, terminal end repair, cyclization, DNA purification and capture and computer sequencing¹². The specific operations are described below. (1) Cell cross-linking: Formaldehyde was used to fix the sample, and intracellular proteins were cross-linked with DNA, preserving their interactions and maintaining the intracellular 3D structure. (2) Endonuclease digestion: DNA was digested by restriction endonucleases to produce sticky ends on both sides of the cross-link. The most commonly used restriction endonuclease is DpnII. (3) End repair: Using the terminal repair mechanism, biotin-labelled bases were introduced to facilitate subsequent DNA purification and capture. (4) Cyclization of the DNA after terminal repair: Cy DNA fragments exhibiting interactions were cyclized to ensure that the location of the interacting DNA was determined during subsequent sequencing and analysis. (5) DNA purification and capture: Cross-links in DNA were reversed, and the DNA was purified and broken into 300 bp-700 bp fragments. The DNA fragments containing interactions were captured using streptavidin magnetic beads, and a library was constructed. After the construction of the library, the concentration of the library and the size of the inserted fragments were detected using the Qubit2.0 and Agilent2100 instruments, respectively, and the effective concentration of the library was quantified accurately by qPCR to ensure the quality of the library. After passing the library inspection, high-throughput sequencing was performed on the Illumina platform, and the sequencing read length was PE150. A Hi-C library was constructed and sequenced, and approximately 68.61 Gb of data were obtained.

Hi-C-assisted genome assembly

The genome sequence was divided, sequenced and oriented with LACHESIS¹³ software, and then the chromosome-level genome was obtained by manual mapping and inspection. First, Hi-C error correction was performed. Specifically, the contig version of the genome was interrupted into equal lengths (50 kb) and reassembled using Hi-C. The fragments that could not be restored to their location in the original assembly sequence were considered candidate error regions, and low Hi-C coverage in these regions indicated the error point, enabling complete error correction of the preliminary genome assembly. Second, Hi-C assembly was performed. Specifically, after error correction, the genome was assembled using LACHESIS software. After Hi-C assembly and manual adjustment of the heatmap, the genomic sequences of the common 886.72 Mb sequence were assigned to 21 chromosomes, accounting for 99.52% of the total length. Among the sequences located on chromosomes, the total length for which the order and direction could be determined was 826.82 Mb, accounting for 93.24% of the total length of the located chromosomal sequence. The detailed distribution of each chromosomal sequence is shown in Table 2.

Table 2 Hi-C assembly data.

Full size table

After the Hi-C error correction, auxiliary chromosome mounting and deredundancy steps, we obtained the final version of the genome for our project. The assembly results showed contig N50 of 0.81 Mb and scaffold N50 of 37.61 Mb. We used the circlize package to construct a loop map, which mainly shows the gene density, repeat sequence density, GC content and collinearity (Fig. 2).

Repeat sequence annotation

The repeat sequences included mainly tandem repeat sequences and scattered repeat sequences. The second type mainly included transposable elements (TEs), which were the main focus of our study. Because the conservation of repeat sequences among species is relatively low, a specific repeat sequence database must be built for the prediction of repeat sequences in a given species. We first used RepeatModeller2¹⁴ (v2.0.1), which includes two ab initio prediction software programs, RECON¹⁵ (v1.0.8) and RepeatScout¹⁶ (v1.0.6), for ab initio prediction and used the Repeat Classifier to classify the prediction results with the help of the Dfam (v3.5) database. Second, we used LTR_retriever¹⁷ (2.9.0) to make ab initio predictions of LTRs, which were mainly based on the prediction results from LTRharvest¹⁸ (v1.5.10) and LTR_FINDER¹⁹ (v1.07). Then, the above ab initio prediction results and the known database were combined to obtain the species-specific repeat sequence database. Finally, RepeatMasker²⁰ (v4.1.2) was used to predict the TEs in the genome based on the constructed repeat sequence database. Finally, a total TE sequence of approximately 306,462,727 bp was obtained, accounting for 34.39% of the genome sequence. The specific prediction results are shown in Table 3.

Table 3 Statistics for TE sequences.

Full size table

The tandem repeat sequences were mainly predicted using the MIcroSAtellite identification tool²¹ (MISA v2.1) and Tandem Repeat Finder²² (TRF, version 409, parameter: 2 7 7 80 10 50 500 -d -h). Finally, a tandem repeat sequence of approximately 67,813,157 bp was obtained, accounting for 7.61% of the total genome sequence. The specific prediction results are shown in Table 4 below.

Table 4 Statistical information on tandem repeat sequences.

Full size table

Coding gene prediction

Generally, three methods are used to predict genes: homology prediction, ab initio prediction and transcriptome prediction. Specifically, Augustus²³ (v3.1.0) and SNAP²⁴ (2006-07-28) were used for ab initio prediction, GeMoMa²⁵ (v1.7) was used for predictions based on homologous species, and the second-generation transcripts used for prediction mainly included transcripts assembled in three ways. One way was to use Hisat²⁶ (v2.1.0), Stringtie²⁷ (v2.1.4) and GeneMarkS-T²⁸ (v5.1) for gene prediction. The second step involved the use of Trinity²⁹ (v2.11) to assemble transcripts and then PASA³⁰ (v2.4.1) for gene prediction, and the third-generation transcriptional group approach involved the use of gmap (2020-06-30) for comparisons after a series of splicing sites and then using PASA (v2.4.1) for gene prediction. Finally, the prediction results from the above three methods were integrated by EVM³¹ (v1.1.1) and modified by PASA (v2.4.1). Finally, we obtained 28,966 genes and counted the number of genes integrated by EVM with support from the three prediction methods (Fig. 3). Most of the genes were supported by at least two prediction methods, indicating that the prediction quality was high.

Noncoding RNA prediction

Noncoding RNAs include microRNAs, rRNAs, tRNAs and RNAs with other known functions. According to the structural characteristics of different noncoding RNAs, different strategies were used to predict them; tRNAscan-SE³² (v1.3.1) was used to identify tRNAs, rRNAs were mainly predicted using barrnap³³ (v0.9), while miRNAs, snoRNAs and snRNAs were predicted using the Rfam³⁴ (v14.5) database and Infenal³⁵ (v1.1). A total of 8,855 tRNAs, 602 rRNAs, and 86 miRNAs were predicted.

Pseudogene annotation

Pseudogenes have sequences similar to those of functional genes but lose their original function due to insertions, deletions and other mutations. Through GenBlastA³⁶ (v1.0.4) alignment, homologous gene sequences (possible genes) were detected in the gene-masked genome, and then immature stop codons and frameshift mutations in the gene sequence were identified using GeneWise³⁷ (v2.4.1).

Gene function annotation

The predicted gene sequences were annotated against the NR, eggNOG³⁸, GO, KEGG³⁹, TrEMBL⁴⁰, KOG, SWISS-PROT⁴⁰ and Pfam⁴¹ databases. A total of 94.58% of the genes could be annotated against the databases, and the functional annotation statistics are shown in Table 5.

Table 5 Statistical information from gene function annotations.

Full size table

Version 5.0 of the Evolutionary Genealogy of Genes: Nonsupervised Orthologous Groups (eggNOG) database contains whole-genome protein sequences from 5090 organisms (477 eukaryotes, 4,445 representative bacteria and 168 archaea) and 2,502 viruses and is an extension of the NCBI COG/KOG database. eggNOG uses 20 functional categories introduced by COG, KOG and arCOG to classify genes at the functional level. In different functional categories, the number of genes can, to a certain extent, reflect the adaptation of the species to the environment, which can be combined with the biological traits or living environment of the research object to provide a scientific explanation. The statistical results of the eggNOG annotation classification are shown in Fig. 4.

Data Records

The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive (Genomics, Proteomics & Bioinformatics 2021) of the National Genomics Data Center (Nucleic Acids Res 2022), China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA014108⁴²), which are publicly accessible at https://ngdc.cncb.ac.cn/gsa^43,44. The whole-genome sequence data reported in this paper have been deposited in the Genome Warehouse⁴⁵ of the National Genomics Data Center⁴⁴, Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under accession number GWHERBV00000000⁴⁶, which is publicly accessible at https://ngdc.cncb.ac.cn/gwh. The assembled genome of A. crassispina has been deposited in NCBI under accession PRJNA1128425⁴⁷.

Technical Validation

Evaluation of the genome assembly

Core Eukaryotic Genes Mapping Approach⁴⁸ (CEGMA) was used to evaluate conserved genes (458 genes) in eukaryotic model organisms to construct a core gene library, which was then combined with tBLASTn, GeneWise, geneid and other software to evaluate the integrity of the assembled genome. The CEGMA integrity of this biological sample was 91.48%.

BUSCOv5.2.1⁴⁹ was used to construct a set of single-copy genes representing several large evolutionary branches based on the linear homology database OrthoDB10. The gene set was compared with the assembled genome and evaluated according to the proportion and integrity of the alignment. The greater the integrity of the genome assembly was, the greater the proportion of complete BUSCOs detected. For nonhighly repetitive and polyploid genomes, the proportion of complete and duplicated BUSCOs detected should not be too high. The metazoan database OrthoDB10 was selected, and the integrity of the BUSCO evaluation was 96.96%. The short sequences obtained using second-generation high-throughput sequencing (such as Illumina sequencing) were compared with the assembled genome with BWA software to evaluate the integrity of the assembly and the uniformity of sequencing coverage. The integrity of the assembled genome and the uniformity of sequencing coverage can be evaluated by calculating the statistical comparison rate, the proportion of genome coverage and the depth distribution. The second-generation read return ratio was 98.80%, and the coverage was 99.69. The average sequencing depth was 35. The HiFi reads were compared with the assembled genome using Minimap2 software to evaluate the integrity of the assembly and the uniformity of sequencing coverage. The third-generation read match ratio was 99.03%, the coverage was 99.98, and the average sequencing depth was 17.

HiFi reads were compared to the assembly results to obtain the coverage depth of each site in the genome. Then, a window with a size of 10 kb was slid continuously along the genome without overlap (if the length of the sequence was less than 10 kb, the true length was selected), and the average sequencing depth (the sum of the sequencing depths of all sites in the window/the size of the window) and the percentage of GC content in the window were calculated. Finally, a density map of the contig GC content distribution vs. sequencing depth distribution was drawn according to the statistical data (Fig. 5).

Quality evaluation of the Hi-C library

HiC-Pro⁵⁰ (v2.10.0) is software that filters and evaluates Hi-C data. It can identify valid interaction pairs and invalid interaction pairs in the Hi-C sequencing results by analysing the comparison results and supports an evaluation of the Hi-C library quality. The reads compared to the specified assembled genome are called mapped reads, and the alignment efficiency refers to the percentage of mapped reads among the clean reads, which represents the utilization of the sequencing data. Comparison efficiency is affected not only by the quality of the data but also by the quality of the designated genome assembly. Using bwa⁵¹ (version: 0.7.17murr1188; alignment mode: other parameters of aln; default), the two-terminal sequencing data were compared with the sequences of the assembled genome. A total of 105,396,342 pairs of reads were obtained, of which 58,164,616 pairs were valid Hi-C data, accounting for 55.19% of the data aligned to the genome.

Evaluation of the Hi-C assembly results

We determined the grouping of chromosomes by cutting the genome of Hi-C assembled into chromosomes 300000 bp and a bin, and then the number of covered Hi-C ReadPairs between any two bins was used as a signal of the intensity of interaction between the two bins to construct a heatmap (Fig. 6). According to the heatmap, 21 chromosome groups can be clearly distinguished; within each group, the intensity of interaction in the diagonal position is greater than that in the nondiagonal position, indicating that the interaction intensity between adjacent sequences in the Hi-C assembly is high, while the strength of the interaction signal between nonadjacent sequences is weak, which is consistent with the principle of Hi-C-assisted genome assembly, which proves that the effect of genome assembly is better.

We used collinear analysis, specifically diamond⁵² (v0.9.29.130), as a method to compare the gene sequences of A. crassispina and Heliocidaris tuberculata⁵³ to determine similar gene pairs and to better verify the accuracy of the genome assembly. Then, according to the gff3 file, we determined whether similar gene pairs were adjacent on the chromosome. This process was performed mainly using MCScanX⁵⁴, and finally, all the genes in the collinear blocks were obtained. A collinear map of the linear patterns of the two species was drawn with JCVI⁵⁵ (v0.9.13) (Fig. 7).

Predictive evaluation of coding genes

The metazoan database in BUSCO⁴⁹ contains 925 complete BUSCOs, accounting for 96.96%. We used BUSCO (v5.2.2) software to evaluate the integrity of the gene predictions, which included 2 fragmented BUSCOs (0.21%) and 27 missing BUSCOs (2.83%), indicating that the integrity of the gene predictions was high.

Code availability

All commands and pipelines used for data processing were executed according to the manual and protocols of the corresponding bioinformatics software.

References

Luo, H. X. et al. Analysis of morphological differences of six wild populations of Anthocidaris crassispina in the South China Sea. Guangdong agricultural science 42, 114–119 (2015).
ADS Google Scholar
Xu, H. et al. Analysis and evaluation of nutritional components of gonads of two kinds of sea urchin. Journal of Nutrition 40, 307–309 (2018).
Google Scholar
Moreno-García, D. M. et al. Sea urchins: An update on their pharmacological properties. PeerJ 10, e13606 (2022).
Article PubMed PubMed Central Google Scholar
Yang, Z. W. et al. Study on artificial breeding technique of Anthocidaris crassispina. Taiwan Strait 20, 32–36 (2001).
Google Scholar
Feng, Y. Q., Xu, Z. J., Qin, R., Shen, M. H. & Zeng, G. Q. Study on artificial breeding technique of Anthocidaris crassispina. Marine science 30, 5–8 (2006).
Google Scholar
Hibino, T. et al. The immune gene repertoire encoded in the purple sea urchin genome. Developmental biology 300, 349–365 (2006).
Article CAS PubMed Google Scholar
Rast, J. P., Smith, L. C., Loza-Coll, M., Hibino, T. & Litman, G. W. Genomic insights into the immune system of the sea urchin. Science 314, 952–956 (2006).
Article ADS CAS PubMed PubMed Central Google Scholar
Kinjo, S., Kiyomoto, M., Yamamoto, T., Ikeo, K. & Yaguchi, S. HpBase: A genome database of a sea urchin, Hemicentrotus pulcherrimus. Development, Growth & Differentiation 60, 174–182 (2018).
Article CAS Google Scholar
Chakraborty, M., Baldwin-Brown, J. G., Long, A. D. & Emerson, J. Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage. Nucleic acids research 44, e147–e147 (2016).
PubMed PubMed Central Google Scholar
Oluwadare, O., Highsmith, M. & Cheng, J. An overview of methods for reconstructing 3-D chromosome and genome structures from Hi-C data. Biological procedures online 21, 1–20 (2019).
Article Google Scholar
Ay, F., Bailey, T. L. & Noble, W. S. Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts. Genome research 24, 999–1011 (2014).
Article CAS PubMed PubMed Central Google Scholar
Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Article CAS PubMed PubMed Central Google Scholar
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nature biotechnology 31, 1119–1125 (2013).
Article CAS PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences 117, 9451–9457 (2020).
Article ADS CAS Google Scholar
Bao, Z. & Eddy, S. R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome research 12, 1269–1276 (2002).
Article CAS PubMed PubMed Central Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358 (2005).
Article CAS PubMed Google Scholar
Ou, S. & Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant physiology 176, 1410–1422 (2018).
Article CAS PubMed Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC bioinformatics 9, 1–14 (2008).
Article Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic acids research 35, W265–W268 (2007).
Article PubMed PubMed Central Google Scholar
Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics 5, 4.10. 11–14.10. 14 (2004).
Article Google Scholar
Beier, S., Thiel, T., Münch, T., Scholz, U. & Mascher, M. MISA-web: a web server for microsatellite prediction. Bioinformatics 33, 2583–2585 (2017).
Article CAS PubMed PubMed Central Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).
Article CAS PubMed Google Scholar
Korf, I. Gene finding in novel genomes. BMC bioinformatics 5, 1–9 (2004).
Article Google Scholar
Keilwagen, J. et al. Using intron position conservation for homology-based gene prediction. Nucleic acids research 44, e89–e89 (2016).
Article PubMed PubMed Central Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nature methods 12, 357–360 (2015).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature biotechnology 33, 290–295 (2015).
Article CAS PubMed PubMed Central Google Scholar
Tang, S., Lomsadze, A. & Borodovsky, M. Identification of protein coding regions in RNA transcripts. Nucleic acids research 43, e78–e78 (2015).
Article PubMed PubMed Central Google Scholar
Grabherr, M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology 29, 644 (2011).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic acids research 31, 5654–5666 (2003).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome biology 9, 1–22 (2008).
Article Google Scholar
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research 25, 955–964 (1997).
Article CAS PubMed PubMed Central Google Scholar
Loman, T. A novel method for predicting ribosomal RNA genes in prokaryotic genomes. (2017).
Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic acids research 33, D121–D124 (2005).
Article CAS PubMed Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
Article CAS PubMed PubMed Central Google Scholar
She, R., Chu, J. S.-C., Wang, K., Pei, J. & Chen, N. GenBlastA: enabling BLAST to identify homologous gene sequences. Genome research 19, 143–149 (2009).
Article CAS PubMed PubMed Central Google Scholar
Birney, E., Clamp, M. & Durbin, R. GeneWise and genomewise. Genome research 14, 988–995 (2004).
Article CAS PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic acids research 47, D309–D314 (2019).
Article CAS PubMed Google Scholar
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic acids research 44, D457–D462 (2016).
Article CAS PubMed Google Scholar
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research 31, 365–370 (2003).
Article CAS PubMed PubMed Central Google Scholar
Finn, R. D. et al. Pfam: clans, web tools and services. Nucleic acids research 34, D247–D251 (2006).
Article CAS PubMed Google Scholar
NGDC Genome Sequence Archive (GSA). https://ngdc.cncb.ac.cn/gsa/browse/CRA014108 (2024)
Chen, T. et al. The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types. Genomics, proteomics & bioinformatics, https://doi.org/10.1016/j.gpb.2021.08.001 (2021).
CNCB-NGDC Members and Partners. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2024. Nucleic acids research 52, D18–d32, https://doi.org/10.1093/nar/gkad1078 (2024).
Article Google Scholar
Chen, M. et al. Genome Warehouse: a public repository housing genome-scale data. Genomics, Proteomics and Bioinformatics 19, 584–589 (2021).
Article PubMed PubMed Central Google Scholar
Genome Warehouse(GWH) https://ngdc.cncb.ac.cn/gwh/Assembly/83691/show (2024).
Zhang, J. & Guo, Y. Genbank https://identifiers.org/ncbi/insdc.gca:GCA_040801975.1 (2024).
Parra, G., Bradnam, K. & Korf, I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
Article CAS PubMed Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome biology 16, 1–11 (2015).
Article Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature methods 12, 59–60 (2015).
Article CAS PubMed Google Scholar
NCBI BioProject: PRJNA827769. “Genome sequencing of Diadema setosum.” Available at: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA827769.
Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic acids research 40, e49–e49 (2012).
Article CAS PubMed PubMed Central Google Scholar
Tang, H., Krishnakumar, V., Li, J. & Zhang, X. jcvi: JCVI utility libraries. Zenodo 30, 2015 (2015).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 32160863), South Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences (No. 2023XK01) and Sanya Yazhou Bay Science and Technology City (No. SKJC-2022-PTDX-014). We thank the anonymous reviewers for their helpful comments and suggestions.

Author information

These authors contributed equally: Jia Zhang, Yu Guo.

Authors and Affiliations

South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou, China
Jia Zhang, Yu Guo, Jiaqi Su, Gang Yu, Zhenhua Ma & Chuanxin Qin
National Agricultural Experimental Station for Fishery Resources and Environment Dapeng, Shenzhen, China
Jia Zhang, Yu Guo & Chuanxin Qin
Key Laboratory of Efficient Utilization and Processing of Marine Fishery Resources of Hainan Province, Sanya Tropical Fisheries Research Institute, Sanya, China
Jia Zhang, Yu Guo, Jiaqi Su, Gang Yu, Zhenhua Ma & Chuanxin Qin
Hainan Seed Industry Laboratory, Sanya, China
Yu Guo

Authors

Jia Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Su
View author publications
You can also search for this author in PubMed Google Scholar
Gang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenhua Ma
View author publications
You can also search for this author in PubMed Google Scholar
Chuanxin Qin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.Q., Y.G. and J.Z. conceived the study. Y.G. and J.Z. collected the samples, extracted the genomic DNA, and conducted sequencing. J.S., Y.G., J.Z., G.Y. and Z.M. performed the bioinformatics analysis. Y.G. and J.Z. wrote the manuscript. All the authors have read and approved the final manuscript.

Corresponding author

Correspondence to Chuanxin Qin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, J., Guo, Y., Su, J. et al. The first high-quality genome assembly and annotation of Anthocidaris crassispina. Sci Data 11, 866 (2024). https://doi.org/10.1038/s41597-024-03733-y

Download citation

Received: 08 March 2024
Accepted: 02 August 2024
Published: 10 August 2024
DOI: https://doi.org/10.1038/s41597-024-03733-y
Springer Nature Limited

The first high-quality genome assembly and annotation of Anthocidaris crassispina

Abstract

Similar content being viewed by others

A high-quality chromosome-level genome assembly of the topmouth culter (Culter alburnus Basilewsky, 1855)

A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila)

Chromosome-level genome assembly of ridgetail white shrimp Exopalaemon carinicauda

Background & Summary