Background & Summary

The Oriental chestnut gall wasp, Dryocosmus kuriphilus (Hymenoptera: Cynipidae) is native to China and naturally distributed in Shanxi, Hebei, Shandong, Hunan, Hubei, Anhui, Henan, Fujian, Zhejiang, Jiangxi and Jiangsu provinces1. It is one of a long list of invasive alien hymenopterans that have established themselves outside their native range. Its rapid expansion ability, coupled with its associated ecological and economic damage, make it one of the most important pests of chestnuts (Castanea spp.) worldwide. As the invasive species, D. kuriphilus has invaded Japan, South Korea, Nepal, Italy, France, Slovakia, Hungary, and the United States2,3,4. Chestnut gall wasps could harm almost all chestnut species belonging to the Castanea genus, such as Castanea mollissima, Castanea henryi, Castanea seguinii, Castanea crenata, Castanea sativa and Castanea dentata, causing serious damage to the production of chestnut5,6,7. It became an important horticultural pest and was listed for quarantine by the European and Mediterranean Plant Protection Organization (EPPO). The larva of chestnut gall wasp lives in the completely sealed gall and efficient absorption of host nutrients, which increases the difficulty of management control8. The pest is a parthenogenetic insect among 48 known species of the Dryocosmus genus and has a strong reproductive capacity with a single female adult producing about 300 eggs. The gall formation and parthenogenesis were considered to be the key reasons for its rapid spread and population growthe9,10. D. kuriphilus could reduce chestnut production by decreasing the formation of female flowers11 and yield can be reduced up to 80%12. D. kuriphilus infestations indirectly reduce leaf area, leading to earlier leaf mortality and abscission, lower leaf biomass, and a reduced ability to produce winter buds3,13 Massive attacks and lasting damage gradually led to a reduction in tree vitality.

In this study, a chromosome-level of the D. kuriphilus genome was performed using a combination of PacBio long reads, Illumina short reads, and chromosome conformation capture (Hi-C) sequencing technologies. The gene models were predicted by EVidence Modeler, embedded in a pipeline that integrates evidences from ab initio predictions, homology-based searches, and full-length transcriptome and RNA-seq alignments. This high-quality reference genome of D. kuriphilus provided not only a valuable resource for understanding the genetics, ecology, and evolution of D. kuriphilus, but also theoretical guidance for explaining the evolutionary mechanism of its environmental adaptation and invasion.

Methods

Sampling, sequencing, and genome size estimation

The galls of Dryocosmus kuriphilus were collected from chestnut trees in a local paddy field in Changsha, Hunan province, China, from May to June 2019. The collected gall was placed in a breeding cage (length × width × height = 30 cm × 20 cm × 30 cm) at 25 °C, and the emerging adult worms were frozen in liquid nitrogen at −80 °C for subsequent sequencing. Genomic DNA was extracted from 20 D. kuriphilus adults for constructing polymerase chain reaction-free (PCR) Illumina 300–500 bp insert libraries and PacBio 20 kb insert library and sequenced on Illumina HiSeq 2500 and PacBio Sequel platforms, respectively. A total of 227.2 Gb Illumina clean reads and 336.5 Gb Pacbio long reads were generated in this study (Table 1), Illumina sequencing data quality assessment revealed that the quality of paired-end Illumina sequencing data is high, with the single-base error rate of over 91.5% sequences being less than 0.001 (i.e., Quality scores 30; Fig. 1A,B). For the PacBio sequencing reads, an average length of long reads was 19.5 Kb with an N50 length of 28.5 Kb, and an N90 length of 10.8 Kb.

Table 1 Library sequencing data and methods used in this study to assemble the D. kuriphilus genome.
Fig. 1
figure 1

Illumina pair-end reads quality scores. (A) Forward strand sequencing reads quality scores. (B) Reverse strand sequencing reads quality scores.

The genome size of D. kuriphilus was estimated using k-mer-based estimation methods. The k-mer distribution of Illumina reads was counted by using jellyfish v2.3.014 (k-mer = 21, parameters: count -m 21 -t 10 -s 1 G). The genome size and the heterozygosity rate were estimated to be ~2752.38 Mb and 0.43%, respectively, by the GenomeScope online version (http://qb.cshl.edu/genomescope/) using the k-mer count distribution file (Fig. 2).

Fig. 2
figure 2

The distribution of K-mer (19-mer) frequency.

Hi-C library preparation and sequencing

Crosslinking was stopped by adding glycine and additional vacuum infiltration. Fixed tissue was then ground to powder before re-suspending in nuclei isolation buffer to obtain a suspension of nuclei. the purified nuclei were digested with 100 units of DpnII and marked by incubating with biotin-14-dATP. Biotin-14-dATP from non-ligated DNA ends was removed owing to the exonuclease activity of T4 DNA polymerase. The ligated DNA was sheared into 300–600 bp fragments and then was blunt-end repaired and A-tailed, followed by purification through biotin-streptavidin-mediated pull-down. Finally, the Hi-C libraries were quantified and sequenced using the Illumina MGI-2000 platform. A total of 357.2 Gb of clean data was generated in this study (Table 1).

Transcriptome sequencing

The transcriptome Illumina sequencing was performed with three periods of D. kuriphilus, including the larvae, pupa and adult, respectively. RNA libraries were prepared using the TruSeq RNA Sample Prep Kit (Illumina, USA) according to the manufacturer’s instructions, and PE150 sequencing was conducted on an Illumina NovaSeq 6000 platform at Novogene Biotech Co., Ltd. (Beijing, China). A total of 35.5 Gb of clean data were generated (Table 1).

For the transcriptome Pacbio sequencing, the equal volume mixed 300 ng total RNA of three periods (the larvae, pupa and adult) was reverse transcribed into cDNA and amplified using NEBNext® Single Cell/Low Input cDNA Synthesis & Amplification Module and Iso-Seq Express Oligo Kit. cDNAs were purified by ProNex Beads and used to construct the library by SMRTbell Express Template Prep Kit 2.0. The SMRTbell template was annealed to sequencing primer bound to polymerase and sequenced on the PacBio Sequel II platform. In this study, a total of 22.2 Gb full-length transcriptome data were generated.

Genome assembly

Wtdbg 215 (parameters: -t 32 -g 2.6 g -x sq -l 4096 -L 10000) was used for the assembly of the D. kuriphilus genome. To polish the draft assembly, PacBio subreads were subjected to three rounds of polishing with the program Racon v1.4.3 (https://github.com/isovic/racon), and then the Illumina paired-end reads were further subjected to three rounds of polishing with the program Pilon v1.2316 (parameters:–fix all–changes). Finally, the total length of the draft genome was 2.17 Gb, comprising 9,372 contigs with a contig N50 of 0.8 Mb (Table 2).

Table 2 Global statistics for assembly of D. kuriphilus genome.

The Hi-C reads were employed to anchor the contigs onto chromosomes through sorting, orientation, and ordering. The 357.2 Gb Hi-C paired-end data were used to group these contigs to the chromosomes by ALLHiC17. Then, we divided the assembled chromosomes into equally sized bins (500 Kb) and constructed an interaction heatmap based on the number of valid paired-end reads supporting interactions between each pair of bins. The visual correction of the assembly was finalized using JuiceBox v.2.1.1018 based on the intensity of chromosome interaction (Fig. 3). Finally, the chromosome-level genome was generated with a N50 of 198.8 Mb and N90 of 158.8 Mb (Table 2).

Fig. 3
figure 3

Landscape of the D.kuriphilus genome. (A) Chromosomal level genome feature distribution in D. kuriphilus. Each track in the figure from outside to inside represents chromosome length, GC content, gene density distribution, DNA transposon density distribution, LINE transposon density distribution, LTR density distribution, SINE density transposon distribution and TRF density distribution. (B) Hi-C interaction heat map of chromosomal level genome in D. kuriphilus.

Genome annotation

A de novo repeat library for D. kuriphilus was constructed by RepeatModeler v. 1.0.4 (http://www.repeatmasker.org/RepeatModeler.html). Transposable elements (TEs) in the D. kuriphilus genome were also identified by RepeatMasker v4.0.6 (http://www.repeatmasker.org/) using both the Repbase library and the de novo library. A total of 1.8 Gb repeat sequences, which occupied 79.7% of the D. kuriphilus genome, were identified in this study, including 62.3% of TEs and 17.4% of tandem repeat (Table 3). We masked the TEs of D. kuriphilus genome for further gene prediction.

Table 3 TE annotation of the D. kuriphilus genome.

For the gene prediction of D. kuriphilus genome, a strategy integrating ab initio prediction, homology searching and transcriptome-based approaches was performed in this study. A total of 122,822 genes were predicted in the D. kuriphilus genome by Augustus (v3.5.3)19. For homologous annotation, we queried the D. kuriphilus genome sequences against a database containing non-overlap protein sequences from three species (Apis mellifera (GCA_003254395.2), Nasonia vitripennis (GCA_009193385.2) and Tribolium castaneum (GCA_000002335.3)) by genBlastA20 (with parameter: -e 1e-2 -g T -f F -a 0.5 -d 100000 -r 10 -c 0.5 -s 0), followed by Genewise21 annotation. A total of 20,395, 29,337, and 24,050 genes were predicted from Apis mellifera, Nasonia vitripennis, and Tribolium castaneum gene sets, respectively. For the RNA-seq annotation, the Illumina pair-end and Pacbio full-length transcript data were mapped to the assembled genome of D. kuriphilus, followed by gene predicted using cufflinks v2.2.1e and PASA v2.3.322,23. The gene sets were generated by combining all the predictions using the EVidenceModeler program (EVM-1.1.1)24. To maintain the confidence of predicted genes, we retained only gene models that had at least one supporting evidence from homologous proteins of closely related species, InterProScan domain and RNA-seq data. Finally, a total of 24,086 protein-coding gene models were predicted in the D. kuriphilus genome (Table 4).

Table 4 The evidence of gene prediction in D. kuriphilus genome.

For functional annotation, we performed searches of our predicted protein-coding genes against the non-redundant (NR) using BLASTP v2.9.03325, Pfam, Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and eggNOG databases. A total of 89.1% (21,460 of 24,086) of protein-coding genes were annotated in this study (Table 5).

Table 5 Annotation statistics for the D. kuriphilus genome.

Data Records

The D. kuriphilus genome project was deposited at NCBI under Bioproject No. PRJNA109237826. Genomic Pacbio sequencing data were deposited in the Sequence Read Archive at NCBI under accession number SRR2846712727. Genomic Illumina sequencing data were deposited in the Sequence Read Archive at NCBI under accession SRR2864693428. Hi-C sequencing data were deposited in the Sequence Read Archive at NCBI under accession number SRR2867963529. Illumina RNA-seq data were deposited in the Sequence Read Archive at NCBI under accession number SRR28520759-SRR2852076630,31,32,33,34,35,36,37, and Pacbio RNA-seq data were deposited in the Sequence Read Archive at NCBI under accession number SRR2850881138. The final chromosome assembly was deposited in GenBank at NCBI under accession number JBBWUJ00000000039. The gene set of D. kuriphilus was available in Figshare under a DOI number of https://doi.org/10.6084/m9.figshare.25800868.v140.

Technical Validation

Three methods were used to evaluate the completeness of the genome assembly. First, 98.1% of the eukaryote core genes from OrthoDB (insecta_odb10) were identified as complete in the reference gene set by BUSCO v5.3.241 (Table 6). Then, we used another evaluation software compleasm v0.2.542 with the insecta-odb10 database to assess the completeness of D. kuriphilus genome. The results showed that 98.32% of the evaluated D. kuriphilus genes were identified as complete (single-copied gene: 95.39%, duplicated gene: 2.93%) (Table 6). We also evaluated the completeness of predicted genes and results showed that 93.8% of predicted gene were identified as complete. Additionally, we used the Illumina short reads and Pacbio long reads to align to the D. kuriphilus reference genome using BWA-MEM version 0.7.1721 (https://github.com/lh3/bwa). The analysis revealed that 98.65% of the short reads and 95.55% of the long reads were successfully mapped to the D. kuriphilus genome.

Table 6 Completeness of the assembled genomes and sets of protein-coding genes evaluated by BUSCO and compleasm analysis.