Background & Summary

Diomorus aiolomorphi Kamijo (Hymenoptera: Torymidae) is a parasitic inquiline associated with the gall maker Aiolomorphus rhopaloides Walker (Hymenoptera: Eurytomidae). D. aiolomorphi and A. rhopaloides are of significant economic significance and predominantly inhabit bamboo forest. Notably, these two species constitute approximately 90% of the insects within this group in such environments1.

The gall maker A. rhopaloides lays its eggs in the internode at the base of the new branch buds, stimulating the paraplegma tissue in these areas. This process inhibits the growth of bamboo plants, leading to a reduction in both the quantity and quality of bamboo shoots. It has been observed that bamboo galls are contagiously distributed across both the culms and branches in a bamboo stand2,3. Its harm makes it a significant factor hindering effective management and economic value of bamboo forests, with notable impacts on both society and the environment4. It not only leads to reduced bamboo yield, lower quality, and decreased market prices but also results in indirect losses such as control and restoration costs and ecological impacts2,3,4,5. Adults of D. aiolomorphi, known as inquilines, oviposit on these young bamboo galls. Unlike typical phytophagous insects, D. aiolomorphi cannot create its own galls but instead feeds on the gall tissues induced by other gall makers4,5. Understanding the attack pattern of D. aiolomorphi on bamboo galls is crucial for assessing and managing the population density of A. rhopaloides1. Despite the commonality of D. aiolomorphi among gall makers and its economic significance, it has received relatively little scientific attention6. Consequently, there is a substantial gap in our understanding of the genetic makeup underpinning the genome of D. aiolomorphi.

In this study, we have assembled the chromosome-level genome of D. aiolomorphi, representing the first chromosome-level sequenced genome of the family Torymidae. The genome size is 1,084.56 Mb, with 1,083.41 Mb (99.89%) assigned to five pseudochromosomes. The scaffold N50 of the genome is 224.87 Mb in length, and the complete Benchmarking Universal Single-Copy Orthologs (BUSCO) score reached 97.3%. A total of 762.12 Mb repetitive elements were identified, accounting for 70.27% of the total genome size. 18,011 protein-coding genes, with functional annotations available for 17,829 of these genes. The high-quality genome assembly of D. aiolomorphi provides a valuable repository for understanding the genomic traits of the Torymidae genomes.

Methods

Sampling

Galls were sampled from bamboo branches at Fuyang, Hangzhou, China (30°03′ N, 119°57′ E) before gall maker emergence, and a total of 1,467 galls were collected. An inquiline is an organism that lives within or on the structure of another organism. The inquiline, D. aiolomorphi, emerged from galls 15–20 days later than the gall maker A. rhopaloides. Before sequencing, both morphological examination7 and COI barcode information confirmed the identification of the species as D. aiolomorphi. The specimens were deposited at the Institute of Insect Sciences, Zhejiang University (ZJUH_20231101). They were preserved in 100% ethanol prior to DNA extraction to maintain the integrity of the genetic material, and subsequently kept in the scientific specimen repository.

Library preparation and genomic DNA sequencing

Genomic DNA was prepared by the sodium dodecyl sulfate (SDS) method followed by purification with QIAGEN® Genomic kit (Qiagen, Hilden, Germany) according to the manufacturer’s standard operating procedure for both long-read and short-read whole genome sequencing (https://www.qiagen.com/us/resources/resourcedetail?id = 566f1cb1-4ffe-4225-a6de-6bd3261dc920&lang = en). RNA extraction was conducted with the TRlzol reagent (Vazyme, Nanjing, China) (https://bio.vazyme.com/product/730.html). The quality of the extracted RNA was assessed using an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). The RNA Integrity Number (RIN) was determined for each sample, ensuring that only high-quality RNA (RIN > 7.0) was used for subsequent sequencing processes. The total data produced from RNA extraction amounted to 4.73 Gb, with a duplication rate of 66.07%. The Q20 (Quality scores > 20) bases totaled 3,607,313,837 (97.8713%), while the Q30 (Quality scores > 30) bases amounted to 3,455,480,380 (93.7518%). Long-read sequencing was performed on the Nanopore GridION X5/PromethION sequencer (Oxford Nanopore Technologies, UK) at Nextomics. Short-read and transcriptome sequencing were sequenced on the Illumina Novaseq/MGI-2000 platforms. The total data generated from the long-read sequencing was 81.21 Gb, while the output from the short-read sequencing totaled 28.37 Gb (Table 1).

Table 1 Library sequencing data and methods used to assemble the D. aiolomorphi genome.

Genome survey and assembly

K-mer analysis was performed using Illumina paired-end sequenced DNA reads. This analysis was conducted before genome assembly to estimate the genome size and the level of heterozygosity. Briefly, quality-filtered reads were subjected to a 21-mer frequency distribution analysis employing Jellyfish v2.2.108. For a read of length L, the number of k-mer produced is (L - 21 + 1). Therefore, the genome size (G) is estimated by the formula: G = Knumber / Kdepth, where Knumber represents the total number of k-mer produced and Kdepth represents the peak value of k-mer depth. Furthermore, the overall genomic properties were inferred by GenomeScope v1.09. The preliminary genome survey of D. aiolomorphi revealed a low level of heterozygosity level (0.19%) within a substantial genome, 988,63 Mb. This estimated genome size was used to evaluate the integrity of the subsequent assembly (Fig. 1, Supplementary Table S1).

Fig. 1
figure 1

The K-mer distribution of the D. aiolomorphi genome. len, genome haploid length; uniq, genome unique length; het, heterozygosity; kcov, genome coverage; err, read error rate; dup, duplicated sequence.

The primary assembly of the clean reads obtained from the Nanopore platform was conducted using nextDenovo v2.5.010, and subsequently corrected using Canu v2.1.111. Illumina paired-end sequenced DNA reads were then utilized to polish and enhance the genome assembly using nextPolish v1.4.012. To eliminate haplotigs and contig overlaps in the de novo assembly, purge_dups v1.2.5 (https://github.com/dfguan/purge_dups) was employed. Finally, the primary assembly yielded 147 scaffolds with 1,084.58 Mb in genome size, 18.13 Mb in contig N50 and 224.87 Mb in scaffold N50.

Chromosome Hi-C assembly

The High-through Chromosome Conformation Capture (Hi-C) method13 was utilized to anchor accurately position hybrid scaffolds onto chromosomes. Genomic DNA was extracted from the thorax of an individual D. aiolomorphi for the Hi-C library. This library, along with the sequencing data was processed via the Illumina Novaseq/MGI-2000 platform. The procedure yielded high-quality clean reads of 110.44 Gb of raw data (Table 1). All subsequent analyses were then applied to these clean reads. The clean Hi-C paired-end reads were initially mapped to the primary assembly using Bowtie2 v2.3.214. Then, HiC-Pro v2.8.115 was utilized to identify valid alignments, simultaneously filtering out multiple hits and singletons alignments. Finally, Lachesis16 was employed to cluster, order and orient the scaffolds. Following Lachesis analysis, 1,083.41 Mb of reads were allocated to five pseudochromosomes, amounting to 99.89% of the final assembly (Fig. 2, Table 2).

Fig. 2
figure 2

Overview of the genomic features of the D. aiolomorphi genome. (a) Genome-wide all-by-all Hi-C interaction identified five pseudochromosome link groups of the Diomorus aiolomorphi genome; (b) Genomic features of the D. aiolomorphi genome. Tracks from outside to inside (a-e) are as follows: pseudochromosomes, GC contents, repeat density, gene density and collinearity between the pseudochromosomes.

Table 2 Statistics of final Hi-C scaffolding genome of D. aiolomorphi.

Assessment of the genome assembly

To assess the completeness and accuracy of the final assembly of D. aiolomorphi genome, Benchmarking Universal Single Copy Orthologs (BUSCO) v5.2.217 with the insect_obd10 database and hymenoptera_obd10 database were utilized. The assessments yielded high BUSCO scores of 97.3% and 91.1%, respectively (Fig. 3, Supplementary Table S2-3). Additionally, to ascertain the integrity of the genome assembly, the five pseudochromosomes from the final assembly were aligned to the Nt library to evaluate the genome assembly using BLAST v2.5.018. Among the 5 chromosomes, 60% (3 pseudochromosomes) showed similarity to Nasonia vitripennis, 20% (1 pseudochromosomes) to Eretmocerus sp. and 20% (1 pseudochromosomes) to Torymus sp. These results suggest the pseudochromosomes sequences are free from sequences of non-target organisms, contaminants, or symbionts presented in the DNA library (Supplementary Table S4).

Fig. 3
figure 3

The BUSCO summary of the D. aiolomorphi genome. The x axis represents the percentage of BUSCOs and the y axis represents BUSCO datasets.

Repetitive element annotation

In the D. aiolomorphi genome, transposon element (TE) were identified using the Extensive de novo TE Annotator (EDTA) v1.9.619. Tandem Repeats Finder (TRF) v4.0920 facilitated the identification of tandem repeats. Based on these findings, a de novo repeat database was consequently generated using RepeatModeler v2.0.221. The known repeats in Dfam database22 were combined with the results of TE detection and the de novo repeat database, creating a reference library that was clustered using Cd-hit v4.8.123 to eliminate redundant sequences. After combining and clustering, comprehensive repeat and TE detection was conducted using RepeatMasker v4.1.2 (https://www.repeatmasker.org/). The genome was found to have a total of 762.12 Mb repetitive sequences, accounting for 70.27% of the genome. Long Terminal Repeat (LTR) elements and DNA transposons emerged as the most predominant types of repeats, representing 24.40% and 22.60% of the genome, respectively (Table 3).

Table 3 Repeat elements statistics in the D. aiolomorphi genome.

Protein-coding genes annotation

Transcriptome sequencing, homologous gene search and de novo prediction were employed to infer the protein-coding genes (PCGs) in the D. aiolomorphi genome, which were then integrated to generate a final gene set. Initially, transcriptome reads were aligned using Hisat2 v2.2.124 and assembled with StringTie v2.1.725. Meanwhile, Trinity v2.8.526 was utilized for de novo assembly of transcriptome reads. Subsequent mapping of the transcriptome assembly to the genome for gene structural prediction by PASA v2.3.327. For the identification of homologous gene sets, sequences from various insects, manually annotated in the Universal Protein Resource database (UniProt, https://www.uniprot.org/) and National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/), were aligned to the D. aiolomorphi genome using Exonerate v2.4.028 and Gemoma v1.7.129. The process of de novo gene prediction involved three separate programs, Augustus v3.3.330, SNAP v2.54.331 and GeneMark-ETP v4.6532. A non-redundant consensus of gene structures was then generated by combining all results using EVidenceModeler v1.1.133. To annotate gene functions, the identified PCGs were aligned to various databases, including Nt, Nr, Swiss-Prot and TrEMBL, employing Diamond v2.0.534 with an e-value threshold of 1e-5. Protein classification and domain search were performed using eggNOG-mapper v2.1.435 and InterProScan v5.8.036. Finally, a total of 18,011 protein-coding genes were predicted, with 17,829 genes (98.99%) functionally annotated (Table 4).

Table 4 Summary of the functional gene annotation of the D. aiolomorphi genome.

Non-coding RNA annotation

To identify noncoding RNA, BAsic Rapid Ribosomal RNA Predictor (BARRNAP) v0.9 and tRNAScan-SE v2.0.537 were executed for predicting rRNA and tRNA, respectively. Infernal v1.1.238 was used to identify the remaining noncoding RNA based on the alignment with the Rfam library39. Finally, 539 noncoding RNAs (ncRNAs) were predicted, including 57 micro-RNAs (miRNAs), 104 ribosomal RNAs (rRNAs), 21 small nuclear RNAs (snRNAs), 15 small nucleolar RNAs (snoRNAs), and 344 transfer RNAs (tRNAs) (Supplementary Table S5).

Data Records

The MGI, ONT, RNA-seq and Hi-C sequencing data used for the genome assembly were deposited in the NCBI Sequence Read Archive (SRA) database with accession numbers SRR2688253040, SRR2688252941, SRR2688253142 and SRR2688252843, respectively, under the BioProject accession number PRJNA1036143. The chromosome assembly was deposited at GenBank with accession number JAXKQO00000000044. Genome annotation information was deposited in the Figshare database45.

Technical Validation

To ensure the reliability and integrity of the genomic data, we implemented rigorous preprocessing protocols on various datasets (Illumina sequencing system protocol: https://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/novaseq/1000000019358_17_novaseq-6000-system-guide.pdf; Nanopore sequencing system protocol: https://a.storyblok.com/f/196663/x/a2ee9a9945/j2586-promethion-24-combined-qsg_170x250mm_rev2-final.pdf), including Illumina paired-end sequenced DNA raw short-reads, Nanopore sequenced DNA raw long-reads, Illumina paired-end sequenced RNA raw reads and Illumina paired-end Hi-C sequences. This preprocessing was carried out using fastp v.0.21.646, a widely recognized tool in genomic studies. The primary objective of this preprocessing step was to filter out low-quality sequences (Quality scores < 20), adapter sequences, reads containing Poly-N and sequences shorter than 30 bp. Following these stringent filtering criteria, we successfully obtained clean reads, which were subsequently stored in the fastq/fasta format.