Two chromosome-level genomes of Smittia aterrima and Smittia pratorum (Diptera, Chironomidae)

Fu, Yue; Fang, Xiangliang; Xiao, Yunli; Mao, Bin; Xu, Zigang; Shen, Mi; Wang, Xinhua

doi:10.1038/s41597-024-03010-y

Two chromosome-level genomes of Smittia aterrima and Smittia pratorum (Diptera, Chironomidae)

Data Descriptor
Open access
Published: 03 February 2024

Volume 11, article number 165, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

Two chromosome-level genomes of Smittia aterrima and Smittia pratorum (Diptera, Chironomidae)

Download PDF

Yue Fu ORCID: orcid.org/0000-0002-4045-4352¹,
Xiangliang Fang¹,
Yunli Xiao¹,
Bin Mao ORCID: orcid.org/0000-0002-5332-7002¹,
Zigang Xu¹,
Mi Shen¹ &
…
Xinhua Wang²

1082 Accesses
Explore all metrics

Abstract

Chironomids are one of the most abundant aquatic insects and are widely distributed in various biological communities. However, the lack of high-quality genomes has hindered our ability to study the evolution and ecology of this group. Here, we used Nanopore long reads and Hi-C data to produce two chromosome-level genomes from mixed genomic data. The genomes of Smittia aterrima (SateA) and Smittia pratorum (SateB) were assembled into three chromosomes, with sizes of 78.45 Mb and 71.56 Mb, scaffold N50 lengths of 25.73 and 23.53 Mb, and BUSCO completeness of 98.5% and 97.8% (n = 1,367), 5.68 Mb (7.24%) and 1.94 Mb (2.72%) of repetitive elements, and predicted 12,330 (97.70% BUSCO completeness) and 11,250 (97.40%) protein-coding genes, respectively. These high-quality genomes will serve as valuable resources for comprehending the evolution and environmental adaptation of chironomids.

The chromosome-level genome assembly of the giant dobsonfly Acanthacorydalis orientalis (McLachlan, 1899)

Article Open access 08 April 2024

A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila)

Article Open access 04 May 2023

Chromosome-level genome assembly of the pygmy grasshopper Eucriotettix oculatus (Orthoptera: Tetrigoidea)

Article Open access 26 April 2024

Background & Summary

The non-biting chironomid midges (Diptera: Chironomidae) have developed unique adaptations enabling them to endure and prosper in severe abiotic conditions, such as extreme temperatures, oxygen deprivation, pollution, high salinity, and varying pH levels, as well as complete desiccation (Henriques-Oliveira et al., 2003; Osbourne et al., 2000; Vos et al., 2002; Pinder, 1986). As remarkably resilient aquatic organisms, research has indicated that chironomid survival in diverse water habitats necessitates specific detoxification enzymes^1,2,3,4.

To gain a better understanding of the fundamental mechanisms of tolerance, researchers have increasingly focused on the genomes of organisms capable of thriving in extreme environments. Three notable examples include the chironomid midges Polypedilum vanderplanki, which can survive near complete water loss in its larval form, Belgica antarctica, which exhibits remarkable freeze tolerance, and Propsilocerus akamusi, a species recognized for its ability to tolerate pollution^5,6,7. Such genomes had shaped by their environment and the adaptations they undergo to survive and reproduce. The Orthocladiinae, which is the largest subfamily within Chironomidae, is characterized by a remarkable diversity of species that have evolved varied ecological and physiological adaptations to their respective environments. Among the Orthocladiinae, the genus Smittia, encompassing species such as S. aterrima and S. pratorum, was notable for its thriving in littoral environments owing to their unique terrestrial/amphibious way of life, as the majority of chironomid larvae are aquatic^8,9,10. Besides, research on Smittia sp. has been instrumental in advancing our understanding of parthenogenesis, polytene chromosomes, embryonic development, and nucleolar RNA synthesis, demonstrating significant scientific importance^{11,12,13,14,15,16}.

With the multiple purposes of generating genomic resources to investigate chironomid genome evolution, chromosomal composition, and tolerant specialization, we utilized Oxford Nanopore Technologies (ONT) and high-throughput chromosome conformation capture (Hi-C) sequencing to produce two chromosome-level reference genomes for S. aterrima and S. pratorum. We performed genome assembly, and annotation, and conducted evolutionary analyses. Our work generates a valuable chromosome-level genomic resource for chironomids, establishing a foundation for future research into the environmental tolerance mechanisms of these insects.

Methods

Sample collection and sequencing

Live male adult specimens of S. aterrima and S. pratorum were collected at Baitan Lake (Huanggang City, Hubei Province, China, 30.463250°N, 114.942184°E). After sample collection, the whole bodies of samples were immediately immersed into liquid nitrogen and stored at −80 °C. There were 300 male adults (about 200 male adults of S. aterrima and 100 male adults of S. pratorum were mixed together) used for genome sequencing.

DNA was extracted using the 1D DNA Ligation Sequencing kit SQK-LSK109. RNA was extracted with the TRIzol™ Reagent kit. After the determination of the DNA quality and quantity, a paired-end sequencing library (350 bp in length) was constructed and sequenced using the Beijing Genomics Institute (BGI), and the library construction was completed by Berry Genomic Corporation (Beijing, China). In addition, a Single Molecule Real-Time DNA library was prepared for sequencing using SQK-LSK109 Kit with an insert size of 30 kb. The Oxford Nanopore third-generation sequencing was completed by BenaGen Corporation in Wuhan, China. The RNA library was constructed with Illumina TruSeq RNA v2 Kit according to the manufacturer’s instructions, and the three-generation full-length (ONT) RNA was extracted by (DP441) RNA prep Pure Plant Plus Kit to construct an ONT PromethION library, was completed by Berry Genomic Corporation in Beijing, China. Hi-C libraries were constructed according to the improved Hi-C procedures¹⁷. including cross-linking of formaldehyde, restriction enzyme digestion, ends repair of fragments, DNA cyclization, DNA purification, and other steps with MboI as the restriction enzyme. Finally, we obtained 93.95 Gb of sequencing data, comprising 28.11 Gb of Illumina reads, 21.16 Gb of Nanopore reads, 13.40 Gb of Hi-C data, and 31.28 Gb of RNA data, which consisted of 21.78 Gb of Illumina sequencing and 9.50 Gb of ONT sequencing. The mean/N50 lengths of the Nanopore and ONT reads were 6.01/21.65 kb and 0.99/1.41 kb, respectively (Table 1). The 28.11 Gb Illumina DNA data was retained after the quality control process and then used for the genome survey. The k-mer (k = 21) analysis demonstrated that the genomes with a low heterozygous ranging from 0.70%‐0.85% (Fig. 1a, Table 2), and the estimated size was about 83.52‐84.51 Mb.

Table 1 Raw data from the different sequencing.

Full size table

Table 2 Results of the Suvery analysis.

Full size table

Genome size estimation and assembly

Quality control of the BGI data was carried out by BBTools v38.82¹⁸: “clumpify.sh” is used to remove repeats; “bbduk.sh” is used for specific quality control, i. e. removing sites with a base mass score below 20 (>Q20), filtering sequences less than 15 bp, removing poly-A/G/C ends over 10 bp, and correcting bases using the overlap region (overlapping reads). The k-mer frequencies were assessed using “khist.sh” (BBTools) with a length set to 21 k-mer. The k-mer analysis was then performed using GenomeScope v2.0¹⁹, with a maximum k-mer coverage of 1,000 (“-m 1000”).

For genomic contig assembly, the ONT raw data were error-corrected by NextDenovo v2.5.0 (https://github.com/Nextomics/NextDenovo), filtered for contaminated sequences using Kraken v2.1.2²⁰, and then assembled using NextDenovo software with parameters read_cutoff = 1k. Sequences below 1 kb in the raw data were filtered. One round of long sequence correction using Inspector v1.2²¹ and two rounds of short sequence correction with NextPolish v1.3.0²² to obtain the corrected genome sequence and further improve the assembly accuracy (Table 3). Minimap2 v.2.17²³ was employed as the read Fundamental mapper during long and short-read polishing stages.

Table 3 Result statistics of the genome assembly.

Full size table

In order to obtain clean data, the adapter sequences of raw reads were trimmed and low-quality reads were removed using Juicer v1.6.2²⁴. Subsequently, the clean reads were mapped to the draft genome into the chromosome using 3D-DNA. Juicebox v1.11.08²⁴ was used to correct possible errors (such as misjoins, translocations, and inversions) in the candidate assembly by visualizing Hi-C heatmaps. Judging from the Hi-C heatmap, information on both species (S. aterrima and S. pratorum) was obtained simultaneously (Table 4, Fig. 1b). Possible contaminants were detected using MMseqs. 2 v11²⁵, which performed BLASTN-like searches based on the NCBI nucleotide (nt) and UniVec databases. The completeness of the genome was evaluated using BUSCO v3.0.2²⁶ with insecta_odb10 dataset (n = 1,367 single-copy orthologues) and BUSCO v5.4.4²⁷ with diptera_odb10 dataset (n = 3,285 single-copy orthologues). To calculate the mapping rate, we mapped ONT long reads and BGI short reads to the assembly using Minimap2. We then calculated the mapping rate using SAMtools v.1.10²⁸ with the ‘flagstat’ parameter. Finally, the genomes of S. aterrima and S. pratorum were assembled into three chromosomes with sizes of 78.45 Mb and 71.56 Mb, the scaffold N50 lengths evaluated with insecta_odb10 dataset were 25.73 Mb and 23.53 Mb, while the GC content was 36.93% and 41.72%, respectively. (Table 5). The results evaluated with diptera_odb10 dataset in Table 6.

Table 4 The Chromosome length and sequencing coverage of S. aterrima (SateA) and S. pratorum (SateB).

Full size table

Table 5 Genome assembly of S. aterrima (SateA) and S. pratorum (SateB) with insecta_odb10 dataset.

Full size table

Table 6 Genome assembly of S. aterrima (SateA) and S. pratorum (SateB) with diptera_odb10 dataset.

Full size table

Genome annotation

Genomes are often annotated with repeat sequences, protein-coding genes, and non-coding RNA.

We used the software RepeatModeler v2.0.2a²⁹ with an LTR discovery pipeline (-LTRStruct) to construct a repeat DNA library. Then the Dfam 3.3³⁰ and RepBase-20181026³¹ databases were merged into a custom library, and finally the software RepeatMasker v4.1.2p1³² with the default commands was used to predict the repeat sequence according to the custom library. The genomes of S. aterrima and S. pratorum produced a total of 27,699 repeats (5.68 Mb) and 15,775 repeats (1.94 Mb), respectively, resulting in a repeat sequence ratio of 7.24% and 2.72%. The five most prevalent classes of repeat sequences were unknown (4.40% and 1.25%), LTR elements (1.13% and 0.62%), DNA elements (0.68% and 0.23%), Simple repeats (0.44% and 0.20%), and LINEs (0.30% and 0.19%). Statistical results are shown in Tables S1, S2.

The protein-coding genes were annotated by integrating the evidence of ab initio, transcriptome-based prediction, and homology-based annotations. The protein coding gene structures were predicted using MAKER v3.01.03 with the default commands³³. For the predictions of ab initio, BRAKER v2.1.6³⁴ and GeMoMa v1.8³⁵ were used to integrate the transcriptomic and protein evidence and to integrate the predicted results of both as the input file for MAKER ab initio (ab. gff3). The transcriptome was aligned with the RNA-seq data to the genome by HISAT2 v2.2.0³⁶ to generate BAM files. Augustus v3.3.4³⁷ and GeneMark-ES/ET/EP 4.68_3.60_lic³⁸ were automatically trained by BRAKER³⁹, and integrate arthropod protein sequences (OrthoDB10 v1 database⁴⁰) to improve the prediction accuracy. We used RNA-seq alignments produced from HISAT2 to perform genome-guided assembly by StringTie v2.1.6⁴¹. For the homology-based approach, GeMoMa with GeMoMa. c = 0.4 GeMoMa. p = 10 parameter was used to perform the annotation of protein-coding based on the annotation of genes of Anopheles arabiensis (GCF_016920715.1), Bradysia coprophila (GCF_014529535.1), Culex quinquefasciatus (GCF_015732765.1), Drosophila melanogaster (GCF_000001215.4), and Hermetia illucens (GCF_905115235.1) from GenBank. Finally, we predicted a total of 12,330 and 11,250 protein-coding genes in S. aterrima and S. pratorum, respectively. These genes had an average of 5.2/5.2 exons per gene, with an average exon length of 429.3/414.5 bp, and an average of 4.1/4.1 introns per gene, with an average intron length of 416.7/414.5 bp. Further, each gene contained 4.9/5.0 CDS, with an average CDS length of 344.4/342.8 bp. BUSCO completeness of the protein sequences was 97.7%/97.4% (n = 1,367), including 75.4%/74.4% single-copy, 22.3%/23.0% duplicated, 0.1%/0.4% fragmented, and 2.2%/2.2% missing BUSCOs, suggesting high-quality predictions.

Non-coding RNAs including transfer RNAs (tRNAs), microRNAs (miRNAs), ribosome RNAs (rRNAs), and small nuclear RNAs (snRNAs) were also identified. The rRNAs, snRNAs, and miRNAs were detected from the Rfam database (release 13.0)⁴² using Infernal v1.1.4⁴³. The tRNAs were predicted using tRNAscan-SE v2.0.9⁴⁴ with the script “EukHighConfidenceFilter”. The rRNAs and subunits were predicted using RNAmmer v1.2⁴⁵. We identified a total of 273 and 273 noncoding RNA sequences were annotated for S. aterrima and S. pratorum. These included 35 and 34 microRNAs (miRNAs), 26 and 42 ribosomal RNAs (rRNAs), 26 and 20 small nucleolar RNAs (snRNAs), 137 and 123 tRNAs, and 45 and 50 other RNA sequences, respectively. The snRNAs identified included 15 and 9 spliceosomal RNAs (U1, U2, U4, U5, U6), 10 C/D box snoRNAs, and 1 HACA-box snoRNA in each species, respectively (Tables S3, S4).

Two strategies were used for the annotation of gene functions. We conducted the gene functional annotation search against the UniProtKB (SwissProt + TrEMBL)⁴⁶ and the nonredundant protein sequence database (NR) using the sensitive mode of Diamond v2.0.11.149⁴⁷ in sensitive mode with the parameters “–very-sensitive -e 1e-5”. We further employed eggNOG-mapper v2.1.5⁴⁸ and InterProScan 5.53‐87.0⁴⁹ to assign Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Reactome pathway annotations and to identify protein domains. Four databases including protein families (Pfam)⁵⁰, SMART⁵¹, Superfamily⁵², CDD⁵³ were searched by InterProScan. The results predicted by the above tools were integrated to obtain the final prediction of gene functions. For S. aterrima and S. pratorum, a high percentage of annotated genes matched the UniProtKB database, with 11,740 (95.21%) and 10,835 (96.31%) genes respectively. The InterProScan database identified protein domains in 9,419/8,811 protein-coding genes, while 10,135/9,447 GO and 4,746/4,491 KEGG were identified by InterProScan and eggNOG-mapper. Furthermore, 7140/6695 genes were annotated as GO terms, 7673/7202 as KEGG ko terms, 2749/2614 as Enzyme Codes, 4746/4491 as KEGG pathways, 9419/8811 as Reactome pathways, and 10762/10047 as COG functional categories (Table 7).

Table 7 Genome annotation and functional annotation results of S. aterrima and S. pratorum.

Full size table

Data Records

All raw sequencing data and the genome assembly of S. aterrima and S. pratorum underlying this article are available at the NCBI and can be accessed with Bioproject ID PRJNA809421. The Nanopore, Illumina, Hi-C, and transcriptome data can be found under identifcation numbers SRR23797681-SRR23797685^{54,55,56,57,58}. The assembled genome has been deposited in the NCBI assembly with the accession number GCA_033063855.1⁵⁹ and GCA_033064975.1⁶⁰. All annotations data for repeated sequences, gene structure, and functional prediction are available for download through Figshare⁶¹ (https://doi.org/10.6084/m9.figshare.22762118).

Technical Validation

Two genome assembly methods, NextDenovo, and 3D-DNA, were executed and subsequently compared (Table 3). The NextDenovo assembly genome was slightly larger than predicted at 151.48 Mb and contained 78 primary contigs, with an N50 contig length of 3.39 Mb, the longest contig of 9.63 Mb, and 98.40% of BUSCO genes. The average GC content was 39.16%. After haplotig purging and genome polishing, the genome length was 151.31 Mb, contained 78 primary contigs with an N50 contig length of 3.38 Mb, the longest contig of 9.62 Mb, and 99.10% of BUSCO genes. The 3D-DNA assembly showed a remarkable improvement, yielding 89 scaffolds with a scaffold N50 length of 25.73 Mb, including the longest scaffold of 29.79 Mb. Additionally, the BUSCO completeness percentage achieved in the 3D-DNA version was 99.2%, with negligible levels of fragmentation and missing genes at 0.00% and 0.80% respectively. After utilizing Hi-C long-range scaffolding to enhance our assembly, we securely anchored the assembled scaffolds into six chromosomes that can be categorized into two species; S. aterrima (SateA) and S. pratorum (SateB) (Fig. 1b). Subsequently, we imported the 3D-DNA assembly to obtain the final results of the chromosome assembly.

The final genome assembly showed a BUSCO completeness of 98.5% and 97.8% (n = 1,367), and duplicated BUSCOs were 1.3% and 0.9%, respectively. Additionally, the high mapping ratios of both BGI and ONT data were 93.38% and 92.35%, respectively. These indicators collectively demonstrate that the assembly has achieved a remarkable degree of continuity and integrity (Table 5).

Code availability

The versions of software used were included in the methods section. No specific script was utilized for this work. All commands and pipelines employed in data processing were executed in accordance with the manual and protocols of the corresponding bioinformatic software.

References

Andersen, T., Baranov, V. & Hagenlund, L. K. Blind Flight? A New Troglobiotic Orthoclad (Diptera, Chironomidae) from the Lukina Jama‐Trojama Cave in Croatia. PloS One. 11, e0152884 (2016).
Article PubMed PubMed Central Google Scholar
Londoño, D. K., Siegfried, B. D. & Lydy, M. J. Atrazine induction of a family 4 cytochrome P450 gene in Chironomus tentans (Diptera: Chironomidae). Chemosphere. 56, 701–706 (2004).
Article PubMed ADS Google Scholar
Londoño, D. K. et al. Cloning and expression of an atrazine inducible cytochrome P450, CYP4G33, from Chironomus tentans (Diptera: Chironomidae). Pestic. Biochem. Physiol. 89, 104–110 (2007).
Article Google Scholar
Sun, Z., Liu, Y., Xu, H. & Yan, C. Genome-Wide Identification of P450 Genes in Chironomid Propsilocerus akamusi Reveals Candidate Genes Involved in Gut Microbiota-Mediated Detoxification of Chlorpyrifos. Insects. 13, 765 (2022).
Article PubMed PubMed Central Google Scholar
Gusev, O. et al. Comparative genome sequencing reveals genomic signature of extreme desiccation tolerance in the anhydrobiotic midge. Nat. Commun. 5, 4784 (2014).
Article CAS PubMed ADS Google Scholar
Shaikhutdinov, N. & Gusev, O. Chironomid midges (Diptera) provide insights into genome evolution in extreme environments. Curr Opin Insect Sci. 49, 101–107 (2022).
Article PubMed Google Scholar
Sun, X. et al. A chromosome level genome assembly of Propsilocerus akamusi to understand its response to heavy metal exposure. Mol. Ecol. Resour. 21, 1996–2012 (2021).
Article CAS PubMed Google Scholar
Cranston, P. S., Oliver, D. R. & Sæther, O. A. in Chironomidae of Holarctic region. Keys and diagnoses (ed. Wiederholm, T.) Part 1. Larvae. (Ent. Scand. Suppl. 19, 1983).
Delettre, Y. R. Short-range spatial patterning of terrestrial Chironomidae (Insecta: Diptera) and farmland heterogeneity. Pedobiologia. 49, 15–27 (2005).
Article Google Scholar
Frouz, J. The effect of vegetation patterns on oviposition habitat preference: A driving mechanism in terrestrial chironomid (Diptera: Chironomidae) succession? Res Popul Ecol. 39, 207–213 (1997).
Article Google Scholar
Brown, P. M. & Kalthoff, K. Inhibition by ultraviolet light of pole cell formation in Smittia sp (Chironomidae, Diptera): Action spectrum and photoreversibility. Dev. Biol. 97, 113–122 (1983).
Article CAS PubMed Google Scholar
Hägele, K. Studies on polytene chromosomes of Smittia parthenogenetica (Chironomidae, Diptera). Chromosoma. 76, 47–55 (1980).
Article Google Scholar
Jacob, J. An electron microscope autoradiographic study of the site of initial synthesis of RNA in the nucleolus of Smittia. Exp. Cell Res. 48, 276–282 (1967).
Article CAS PubMed Google Scholar
Jäckle, H. & Kalthoff, K. Proteins foretelling head and abdomen development in the embryo of Smittia spec. (Chironomidae, Diptera). Dev. Biol. 85, 287–298 (1981).
Article PubMed Google Scholar
Kalthoff, K., Ran, K.-G. & Edmond, J. C. Modifying effects of UV irradiation on the development of abnormal body patterns in centrifuged insect embryos (Smittia spec., Chironomidae, Diptera). Dev. Biol. 91, 413–422 (1982).
Article CAS PubMed Google Scholar
Ripley, S. & Kalthoff, K. Changes in the apparent localization of anterior determinants during early embryogenesis (Smittia spec., Chironomidae, Diptera). Wilhelm Roux’s Arch. Dev. Biol. 192, 353–361 (1983).
Article Google Scholar
Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Article CAS PubMed PubMed Central Google Scholar
Bushnell, B. BBtools. Retrieved from https://sourceforge.net/projects/bbmap/ (2014).
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol. 20, 257 (2019).
Article CAS PubMed PubMed Central Google Scholar
Chen, Y., Zhang, Y. X., Wang, A. Y., Gao, M. & Chong, Z. C. Accurate long-read de novo assembly evaluation with Inspector. Genome Biol. 22, 312 (2021).
Article PubMed PubMed Central Google Scholar
Hu, J., Fan, J., Sun, Z. Y., Liu, S. L. & Berger, B. NextPolish: a fast and efficient genome polishing tool for long read assembly. Bioinformatics. 36, 2253–2255 (2020).
Article CAS PubMed Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95–98 (2016).
Article CAS PubMed PubMed Central Google Scholar
Steinegger, M. & Söding, J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Waterhouse, R. M. et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol. Biol. Evol. 35, 543–548 (2018).
Article CAS PubMed Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simao, F. A. & Zdobnov, E. M. BUSCO update: Novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol. 38, 4647–4654 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map Format and SAMtools. Bioinformatics. 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. 117, 9451–9457 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Hubley, R. et al. The Dfam database of repetitive DNA families. Nucleic Acids Res. 44, D81–D89 (2016).
Article CAS PubMed Google Scholar
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA. 6, 1–6 (2015).
Article Google Scholar
Smit, A. F. A., Hubley, R., & Green, P. RepeatMasker Open-4.0. Available online: http://www.repeatmasker.org (accessed on 1 October 2022) (2013‐2015).
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinf. 12, 491 (2011).
Article Google Scholar
Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M. & Stanke, M. BRAKER1: unsupervised RNA-Seq-Based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics. 32, 767–769 (2016).
Article CAS PubMed Google Scholar
Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods Mol. Biol. 1962, 161–177 (2019).
Article CAS PubMed Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: A fast spliced aligner with low memory requirements. Nat. Methods. 12, 357–360 (2015).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M., Steinkamp, R., Waack, S. & Morgenstern, B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 32, W309–W312 (2004).
Article CAS PubMed PubMed Central Google Scholar
Brůna, T., Lomsadze, A. & Borodovsky, M. GeneMark-EP+: Eukaryotic gene prediction with self-training in the space of genes and proteins. Nar Genomics Bioinf. 2, lqaa26 (2020).
Article Google Scholar
Tomas, B., Katharina, J. H., Alexandre, L., Mario, S. & Mark, B. BRAKER2: Automatic eukaryotic genome annotation with GeneMark- EP+ and AUGUSTUS supported by a protein database. Nar Genomics Bioinf. 3, lqaa108 (2021).
Article Google Scholar
Kriventseva, E. V. et al. OrthoDB v10: Sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic Acids Res. 47, D807–D811 (2019).
Article CAS PubMed Google Scholar
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 278 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kalvari, I. et al. Rfam 13.0: Shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res. 46, D335–D342 (2018).
Article CAS PubMed Google Scholar
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 29, 2933–2935 (2013).
Article CAS PubMed PubMed Central Google Scholar
Chan, P. P. & Lowe, T. M. TRNAscan-SE: Searching for tRNA genes in genomic sequences. Methods Mol Biol. 1962, 1–14 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lagesen, K. et al. RNAmmer: Consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100–3108 (2007).
Article CAS PubMed PubMed Central ADS Google Scholar
Morgat, A. et al. Enzyme annotation in UniProtKB using Rhea. Bioinformatics 36, 1896–1901 (2020).
Article CAS PubMed Google Scholar
Buchfink, B., Reuter, K. & Drost, H. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods. 18, 366–368 (2021).
Article CAS PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Mol. Biol. Evol. 34, 2115–2122 (2017).
Article CAS PubMed PubMed Central Google Scholar
Finn, R. D. et al. InterPro in 2017—beyond protein family and domain annotations. Nucleic Acids Res. 45, D190–D199 (2017).
Article CAS PubMed Google Scholar
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).
Article CAS PubMed Google Scholar
Letunic, I. & Bork, P. 20 years of the SMART protein domain annotation resource. Nucleic Acids Res. 46, D493–D496 (2018).
Article CAS PubMed Google Scholar
Wilson, D. et al. SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res. 37, D380–D386 (2009).
Article CAS PubMed Google Scholar
Marchler-Bauer, A. et al. CDD/SPARCLE: Functional classification of proteins via subfamily domain architectures. Nucleic Acids Res. 45, D200–D203 (2017).
Article CAS PubMed Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23797681 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23797682 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23797683 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23797684 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23797685 (2023).
NCBI Assembly https://identifiers.org/ncbi/insdc.gca:GCA_033063855.1 (Smittia aterrima) (2023).
NCBI Assembly https://identifiers.org/ncbi/insdc.gca:GCA_033064975.1 (Smittia pratorum) (2023).
Fu, Y. Genome assembly and annotations of Smittia aterrima and Smittia pratorum (Diptera, Chironomidae). figshare https://doi.org/10.6084/m9.figshare.22762118 (2023).

Download references

Acknowledgements

This study was supported by the National Natural Science Foundation of China (NSFC) (Grant No. 32070483, 31460572), Natural Science Foundation of Hubei Province (Grant No. 2020CFB757), Scientific Research Starting Foundation for Ph.D. of Huanggang Normal University (Grant No. 2020010).

Author information

Authors and Affiliations

Hubei Key Laboratory of Economic Forest Germplasm Improvement and Resources Comprehensive Utilization, Hubei Collaborative Innovation Center for the Characteristic Resources Exploitation of Dabie Mountains, Hubei Zhongke Research Institute of Industrial Technology, College of Biology and Agricultural Resources, Huanggang Normal University, Huanggang City, Hubei, 438000, China
Yue Fu, Xiangliang Fang, Yunli Xiao, Bin Mao, Zigang Xu & Mi Shen
College of Life Sciences, Nankai University, Tianjin, 300071, China
Xinhua Wang

Authors

Yue Fu
View author publications
You can also search for this author in PubMed Google Scholar
Xiangliang Fang
View author publications
You can also search for this author in PubMed Google Scholar
Yunli Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Bin Mao
View author publications
You can also search for this author in PubMed Google Scholar
Zigang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Mi Shen
View author publications
You can also search for this author in PubMed Google Scholar
Xinhua Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yue Fu conceived the study and wrote the manuscript, Yue Fu and Xiangliang Fang collected the samples. Bin Mao and Zigang Xu performed bioinformatics analysis. Yunli Xiao and Mi Shen extracted the DNA. Xinhua Wang put forward suggestions for modification. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yue Fu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary tables

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Fu, Y., Fang, X., Xiao, Y. et al. Two chromosome-level genomes of Smittia aterrima and Smittia pratorum (Diptera, Chironomidae). Sci Data 11, 165 (2024). https://doi.org/10.1038/s41597-024-03010-y

Download citation

Received: 04 December 2023
Accepted: 26 January 2024
Published: 03 February 2024
DOI: https://doi.org/10.1038/s41597-024-03010-y
Springer Nature Limited

Two chromosome-level genomes of Smittia aterrima and Smittia pratorum (Diptera, Chironomidae)

Abstract

Similar content being viewed by others

The chromosome-level genome assembly of the giant dobsonfly Acanthacorydalis orientalis (McLachlan, 1899)

A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila)

Chromosome-level genome assembly of the pygmy grasshopper Eucriotettix oculatus (Orthoptera: Tetrigoidea)

Background & Summary

Methods

Sample collection and sequencing

Genome size estimation and assembly

Genome annotation

Data Records

Technical Validation

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary tables

Rights and permissions

About this article

Cite this article

Ecological data for tracking biological diversity and environmental change

Navigation

Two chromosome-level genomes of Smittia aterrima and Smittia pratorum (Diptera, Chironomidae)

Abstract

Similar content being viewed by others

The chromosome-level genome assembly of the giant dobsonfly Acanthacorydalis orientalis (McLachlan, 1899)

A high-quality chromosomal-level genome assembly of Greater Scaup (Aythya marila)

Chromosome-level genome assembly of the pygmy grasshopper Eucriotettix oculatus (Orthoptera: Tetrigoidea)

Background & Summary

Methods

Sample collection and sequencing

Genome size estimation and assembly

Genome annotation

Data Records

Technical Validation

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary tables

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation