Background & Summary

The order Strepsiptera, commonly known as twisted-wing insects, are a group of enigmatic obligate parasites, with 10 extant families and approximately 600 described species1. High diversification rates associated with adaptive specialization to their hosts has allowed them to parasitize 7 orders and 34 families of insects2,3. More than 97% of Strepsiptera species are characterized by extreme sexual dimorphism, except for the ancestral family Mengenillidae where winged males and apterous females emerge as free-living adults3,4. In the remaining families, males develop inside their host and emerge as free-living winged adults, while adult neotenic females remain as obligate endoparasites inside their host5,6. The grub-like females lack a distinctive head, thorax, abdomen or body appendages and are viviparous, with some of the highest fecundity rates reported for any insect7.

A parasitic lifestyle coupled with unusual life history traits have been linked to unique genetic characteristics: exceptionally small genomes reported by flow cytometry, different levels of endoreduplication between males and females, and high rates of molecular evolution7,8. These unusual genetic and morphological features, combined with scarce genomic data available for the order, hindered efforts to place this order in the insect phylogeny for decades calling it the Strepsiptera problem9. However, recent studies used a whole genome shotgun sequencing approach of the basal species Mengenilla moldrzyki (Mengenillidae) to resolve the enigma and placed Strepsiptera confidently as the closest living relative to beetles (Coleoptera)10,11. Similarly, other studies now provide mitogenomic and transcriptomic data, in addition to phylogenomic comparisons based off shotgun whole genome sequencing for these unique parasites12,13,14,15,16,17.

Parasites often use metabolic resources from their hosts18. Therefore, selection for metabolic efficiency, such as costs of replicating DNA, can result in smaller genomes19,20. Similarly, organisms that undergo massive changes of their body plan when transitioning to an obligate endoparasitic life, often experience extensive gene losses as a tradeoff21,22. In Strepsiptera, genome size reduction or potential gene losses associated with host specialization and extreme sexual dimorphism combine compelling evolutionary scenarios to understand genome evolution7. In fact, the only cytogenetic data available for Strepsiptera indicates they have XY sex chromosomes and substantial variation in chromosome numbers within the same family (Xenidae). Specifically, these karyotypic studies report 2n = 16 for Xenos peckii23,24 and 2n = 8 and for Brasixenos. sp. from Brazil25. Xenidae is one of the most derived families in Strepsiptera and exhibits a high degree of host specialization. The ability to colonize new host lineages resulted in a constant net diversification rate that has produced between 77 and 96 potentially cryptic species26.

Throughout the world, many species of Xenos consistently infect social wasps of the genus Polistes (Fig. 1)27,28. Studies in two parallel Xenos-Polistes systems in Europe and North America reveal consistent behavioral and physiological manipulation strategies5,29,30. Parasitic effects include hindering host reproduction and manipulating wasps to abandon their colony to form aggregations that facilitate mating of adult parasites. Thus, members of Xenos evolve in a continuous arms race with their hosts to overcome the host immune system31, which also may influence genome evolution. As the genomic revolution advances, sequencing of high-quality Strepsiptera genomes is key to understanding the molecular mechanisms underpinning host manipulation, as well as the forces driving genome size evolution in this enigmatic group of insects.

Fig. 1
figure 1

Description of the study system. (a) A paper wasp, Polistes fuscatus, infected with two male parasites lodged in the abdominal cavity (right). An adult, free-living male Xenos peckii emerged from this host (left) (Scale Bar = 2 mm). (b) Detailed morphology of males and females. Left: A free-living adult male that emerges from the abdomen of the host. Center: close-up of wasp abdomen that shows the protruding cephalothorax of an obligate endoparasitic female between abdominal segments, and the empty pupal case where the male parasite emerged from. Females are dorsoventrally flattened and lack all adult insect features (i.e., eyes, antennae, mouth parts, legs, wings). Right: infected wasp host and size of parasites lodged in its abdomen for scale.

In this study, we present the first de novo genome assembly of the order Strepsiptera – A male Xenos peckii (Strepsiptera: Xenidae)– using PacBio HiFi sequencing technology. After filtering out bacterial scaffolds, the haploid assembly size is 72.1 Mb, assembled in 227 contigs, a contig N50 of 7.3 Mb, a GC content of 23.4%, BUSCO score of 87.4%, and a repetitive element content of 38.41%. This is the most complete and contiguous genome assembly available for the order Strepsiptera up to date. More than 75% of the genome is contained in 8 major contigs (L75 = 8) and the longest contig is 8.76 Mb. The 8 major contigs reflect the number of chromosomes reported from karyotypic data (n = 8)23,25. Furthermore, we identified 3 contigs that exhibit sex chromosome type coverage patterns (half depth of coverage in males). Additionally, we assembled and annotated the mitochondrial genome, which is 16111 bp long and has 37 genes. The size and gene content of the mitochondrial genome is similar to that of published mitogenomes for X. vesparum found in Europe and its close relative, X. cf. moutoni from China13,14, recently named X. yangi15. To our knowledge, this twisted-wing parasite genome is among the 10 smallest insect genomes reported in the literature, and the smallest insect genome assembled using long PacBio HiFi reads32. Our high-quality genomic data provides a unique tool to improve our understanding of the complex genome evolution associated with extreme sexual dimorphism coupled with a parasitic lifestyle. Additionally, this genome will facilitate the exploration of the molecular mechanisms underlying host and twisted-wing parasite coevolution.

Methods

Specimen collection and sample preparation

During June of 2021, we used entomological nets to collect infected social wasps, Polistes fuscatus, at the Robert H. Treman State Park located in Ithaca, NY, USA. We used taxonomic studies with detailed morphological characteristics to confirm that the parasite was X. peckii23,33,34. Several studies also included information about voucher specimens, confirming that P. fuscatus is infected by X. peckii in Ithaca, NY23,33,34. Infection and stage of development of X. peckii was confirmed by the extrusions in the host abdomen, which are morphologically distinct for developing female (N = 10) and male parasites (N = 10)5. We transported the infected wasps to the University of Rochester in disposable and transparent deli cups that are pre-punched to increase air flow. Each infected wasp was housed individually in a deli cup with a water vial and a sugar cube and kept under controlled conditions that included: temperature of 25 °C, 40% humidity, and full spectrum light for 12 hours. We inspected infected wasps daily during the extrusion period of 3–7 days and 9–17 days for male and female parasites, respectively. After emerging from a host, male parasites were frozen with dry ice and stored at −80 °C. A single female was dissected out of its host and preserved the same way. A single male X. peckii was preserved for an ultra-low DNA input sample preparation and subsequent long-read sequencing, while an additional male and the female were preserved for short-read sequencing. Finally, we froze one more single male in liquid nitrogen for subsequent extraction of RNA, sequencing, and genome annotation.

DNA and RNA extraction, library preparation, sequencing and pre-assembly read processing

Library preparation and sequencing for PacBio HiFi data was performed at the DNA sequencing and Genotyping Center of the Delaware Biotechnology Institute at University of Delaware. Total genomic DNA was isolated using the Qiagen MagAttract kit (QIAGEN, shanghai, China) following the manufacturer’s instructions. One SMRTbell library of circular consensus sequencing (CCS) was constructed according to the Low DNA input PacBio protocol with additional bead cleaning. The library was then sequenced on one PacBio Sequel II SMRT cell platform in CCS mode, which calls consensus reads from subreads that are generated after multiple passes of the enzyme around the circularized template (Pacific Biosciences). Sequencing yielded 8.6 Gb of data and ~0.9 million HiFi (high-fidelity) reads from the male X. peckii sample (Table 1).

Table 1 General statistics of raw sequencing reads used for X. peckii genome assembly and genome size estimation.

We extracted Genomic DNA from the additional male and the female X. peckii using a DNeasy Blood and Tissue kit (QIAGEN, Valencia, CA, USA) for library preparation and short read whole genome Illumina sequencing. RNA from a whole male was extracted with the RNeasy Mini Kit, implementing a DNase digestion step with the RNase-Free DNase Set (QIAGEN, Valencia, CA, USA). DNA and RNA samples were received and validated via standard quality control procedures at Novogene Co., Sacramento, CA. To construct the genomic DNA library, DNA was sheared into short fragments which were end repaired, A-tailed and ligated with Illumina adapter. Fragments with adapters were PCR amplified, size selected and purified. Each library was checked with Qubit and real time PCR for quantification, and then pooled for sequencing on a NovaSeq PE 150 Illumina platform (Novogene Co. Ltd. Sacramento, CA). Sequencing results generated 7.5 Gb and 6.0 Gb of data for the male and female samples, respectively (Table 1).

After sequencing, we assessed read quality using FastQC35. Then, we filtered reads for quality and length, and trimmed adapter sequences and PolyG tails using fastp36 with the options -l 40. Then we re-assessed the quality of the processed reads, which we used to validate the genome size, heterozygosity and repetitive content estimations and to identify sex chromosomes based on coverage patterns. The transcriptome data (total RNAseq) was also sequenced with the NovaSeq PE 150 platform, using an mRNA library preparation with standard poly A enrichment, which generated 6 GB of raw data to implement in the genome annotation.

Genome assembly and size estimation

Prior to assembly, estimating features like genome size, rate of heterozygosity, and repetitive element content proves useful to inform the parameter values that should be used in subsequent steps. We used Jellyfish (v2.3.0)37 to count and compute a histogram of k-mer frequencies from the raw PacBio HiFi reads using the count (-C -m 21) and histo (–h 1,000,000) modules (Table 2).

Table 2 Software and version used for analyses in this study with corresponding parameters, if different from the default options.

Then we used the Jellyfish histogram output to run the online web tool of GenomeScope2 (http://qb.cshl.edu/genomescope/genomescope2.0) with the following parameters: K-mer length = 21, ploidy = 2 and max kmer coverage = 1,000,00038. The model suggests that X. peckii has an estimated genome size of 63,567,278 bp with 1.19% heterozygosity levels, and 36.4% of the genome composed of repetitive elements (Fig. 2). The X. peckii genome we report here is one of the smallest insect genomes documented up to date7 (Table 3; See technical validation below).

Fig. 2
figure 2

Genome size and content estimation using GenomeScope 2.0. Genomescope2 k-mer (21) distribution from the adapter trimmed PacBio HiFi reads. K-mer depth is plotted against k-mer frequency for a given depth. The plot displays estimation of genome size (len), percentage of the genome that is not in repetitive elements (uniq), homozygous rate (aa), heterozygous rate (ab) mean k-mer coverage for heterozygous bases (kcov), read error rate (err), average rate of read duplications (dup), k-mer size used on the run (k) and ploidy (p). (a) Expected distribution according to the model. (b) Obtained distribution according to the model. The four peaks correspond to the mean coverage levels of the unique heterozygous, unique homozygous, repetitive heterozygous and repetitive homozygous sequences, respectively.

Table 3 GenomeScope estimation of genome structure using reads from different sequencing technologies.

After a preliminary survey of the genome structure, we ran four different assembly software optimized for HiFi reads (i.e., Hifiasm39, Improved Phased Assembler IPA40, Hicanu41, Flye42) and computed BUSCO scores to gauge differences in assembly metrics. Hifiasm produced the highest quality genome (see technical validation below; Table 4; Fig. 3).

Table 4 Assembly statistics using four different assembly software.
Fig. 3
figure 3

Assembly statistics with different software. (a) QUAST comparison of assembly contig length produced by different software. (b) BUSCO scores produced by each software. Assembly with Hifiasm shows the highest completeness score with lowest duplication rates. Each assembler indicates that approximately 10% of the genes in the BUSCO database are missing from the genome.

We assembled the HiFi reads without additional data into a draft haplotype-resolved genome assembly using Hifiasm v0.16.1-R34139. We then used the default parameters of Hifiasm to generate a primary and alternate assembly graph after aggressive purging of haplotypic duplications (-primary, -l 3) and three rounds of error correction39. The primary assembly produced by Hifiasm was 73,541,611 bp long, assembled in 229 contigs with an N50 of 7.3 Mb and a GC content of 23.6%. Both the primary (73 Mb) and alternate (64 Mb) assemblies produced by Hifiasm were similar to the estimated genome size produced by GenomeScope2.

However, the primary assembly was slightly larger than the GenomeScope2 estimations (Table 3; Fig. 2). High heterozygosity can result in spurious duplications that increase the genome size because two alleles from the same loci are included in the primary assembly. We used Jupiter Plot (−n = 50000, ng = 75)43 to produce a synteny plot and compare completeness between the primary and alternate assemblies produced by Hifiasm. The primary assembly produced by Hifiasm is more contiguous and the alternate assembly doesn’t include some of the largest contigs in the primary assembly (potential sex-chromosomes based on coverage patterns; Supplementary Figure 1a). Thus, the small discrepancies (<10 Mb) between the GenomeScope genome size estimations and the final assembly size may be due to assembler limitations when sorting or deduplicating repetitive regions.

Genome quality assessment

We used Blobtools2 from the BlobToolKit suite44 to screen for contamination. First, we performed a BLASTn45 search of our assembly against the general RefSeq blast -nt database using the following parameters: -outfmt ‘6 qseqid staxids bitscore std’ -max_target_seqs 1 -num_threads 12 -evalue 1e-6. Then we used the function blobtools –add to create a BlobDir database that included the blast output (hits file), the read coverage (bam file) from mapping the raw reads back to the assembly, and the BUSCO scores (Busco summary file; see below). We implemented the BlobToolKit v.1.1.144 online Viewer (blobtools host ‘pwd’) to create a Blobplot with the hits of our bacterial contamination scan, and their respective coverage (Supplementary Figure 2). We removed two contigs that matched the phylum Proteobacteria. Specifically, one whole contig was the genome of Wolbachia pipitensis (see technical validation below; Supplementary Figure 2). Then, we used QUAST v346 to assess the metrics of the curated assembly. The final assembly size was 72,105,243 Mb, assembled in 227 contigs with a GC content of 23.4%. We used the BlobtoolKit v.1.1.144 to create a SnailPlot to visualize our assembly statistics (Table 5; Fig. 4). This Whole Genome PacBio HiFi assembly project has been deposited at DDBJ/ENA/GenBank under the accession JAWUEG00000000063.

Table 5 Statistics produced by QUAST for final assembly using Hifiasm after removing bacterial contamination.
Fig. 4
figure 4

Snail plot assembly visualization using BlobTools 2.0. The contiguity and completeness of the X. peckii genome assembly after contamination screening is plotted as a circle that represents the full length of the assembly (~72.1 Mb), assembled in 227 contigs. The N50 (7.38 Mb) is highlighted in dark orange and the N90 (470k) in light orange. The longest contig was 8.76 Mb (highlighted in red). The assembly has a uniform GC content of 23.4% and few contigs <100 Kb in length. The BUSCO scores are shown in the top right corner in green.

We evaluated assembly quality and completeness using Benchmarking Universal Single-copy Orthologs (BUSCO) v.5.2.247 with the insecta_odb10 database. From a total of 1367 BUSCO groups searched, our assembly has 87.4% of genes complete and present in single copy, 0.7% duplicated genes, 1.9% fragmented genes, and 10.4% of the genes missing (Fig. 3). Additionally, we evaluated our assembly against the 2124 BUSCO groups from the endopterygota_odb10 database. This assessment shows 74.8% of complete and single copy genes, 1.1% duplicated genes, 6.5% fragmented genes, and 17.6% missing genes. The underlying cause behind the high proportion of missing genes from both datasets will be clarified once more genomic data for the order becomes available.

Mitochondrial genome assembly

We assembled the mitochondrial (mtDNA) genome of X. peckii from the raw PacBio HiFi reads with the MitoHiFi pipeline48,49, which uses a reference-guided method to perform the assembly. First, the pipeline implemented the software MitoFinder49 to scan all the mtDNA assemblies available on NCBI and downloads the highest quality reference mitogenome for the most closely related taxa. PacBio HiFi reads were mapped to the reference mitogenome using Minimap2 v.2.1750. Then, all the mapping reads were assembled de novo with Hifiasm39 into a mitogenome. A mitochondrial genome is available for X. vesparum (partial genome: 14,519 bp NCBI Accession number DQ364229). In addition, two independent mitogenomes were found for X. yangi, previously X. cf. moutoni (partial genome: 16,717 bp NCBI Accession number MW222190 and partial genome: 15,324 bp NCBI Accession number OK329871)13,14,15. Following the MitoFinder output, we used the mitochondrial genome provided under X. yangi, as the starting reference sequence, because it is the most complete sequence available15. Then, we used MitoHiFi (-p 90 -o 5) to assemble the mitogenome. A total of 30,623 HiFi reads mapped to the reference and were used for de novo assembly with Hifiasm. The final mitogenome size assembled for X. peckii was 16,111 bp (NCBI Accesion JAWUEG000000000)63. To confirm the accuracy of the mitogenome, we used NOVOPlasty51 to assemble an independent mitogenome de novo. We used the Illumina 150 bp short reads from the male X. peckii (Table 1) and the COX1 X. peckii gene sequence available on NCBI (Accesion JN082808)8 for the starting reference, as input for NOVOPlasty. This de novo NOVOPlasty assembly resulted in a mitogenome of 15,946 bp, a similar size and syntenic to the one assembled with MitoHiFi (Supplementary Figure 3).

Subsequently, we used the software MITOS252 to annotate the mitochondrial genome. The sequence spans the full region between rrnL and ND2, contains 37 genes and is circularized (Fig. 5a). After manual inspection and curation of the mitogenome, we found that MITOS2 was unable to annotate the complete region of the 16S rRNA gene (i.e., rrnS). In Strepsiptera, a recent study highlights that the rrnS gene appears to have highly divergent regions in the 5’ section of the gene that is flanked by trnV, which hinders the annotation process17. Additionally, to assess the coverage along the sequence, we mapped back to the assembly all the reads that aligned to the original reference X. yangi mitogenome, as well as all the HiFi reads available. We found relatively homogeneous coverage along the sequence, except in the first 2 kb which is likely composed of repetitive elements (Fig. 5b,c). Finally, we used BLAST + v2.145 to search for matches of our mitochondrial assembly in the nuclear genome and filtered out contigs from the nuclear genome with a percentage of sequence identity >99% and of smaller size than the mtDNA sequence53.

Fig. 5
figure 5

Mitogenome annotation and coverage. (a) Annotation and protein prediction plot produced by MITOS2. The figure legend highlights the position and quality of the sequence by color-coding individual genes. (b) Coverage along the sequence, using only reads that mapped to the reference X. yangi mitogenome. (c) Coverage along the sequence implemented all HiFi reads available.

Mapping rate and coverage for X-linked contigs discovery

Cytogenetic data for the order Strepsiptera is only available for two species, and it indicates that they have heteromorphic X and Y chromosomes12,54. Specifically, the number of chromosomes of X. peckii was identified as 2n = 1623,25. To evaluate mapping rate and depth of coverage along the assembly we used Minimap2 v.2.1750 to map the raw long PacBio HiFi reads, as well as the short Illumina (150 bp) PE reads, back to the final assembly. The male X. peckii HiFi reads had an average mapping rate of 99.43% and 97.8% coverage, with a mean depth of coverage of 117X. For the Illumina PE reads the mapping rate was 94.95% for the male and 77.61% for the female, with a mean depth of coverage of 97.27X and 63.73X respectively. We used the CIRCOS55 tool to visualize the depth of coverage through the first 11 contigs that contain >90% of the genome. We identified three X-linked contigs (ctg02, ctg23 and ctg15), which presented a male to female coverage ratio of 0.5. Interestingly, ctg02 is also the largest contig in our assembly (Fig. 6).

Fig. 6
figure 6

X-linked contig discovery based on depth of coverage. Average coverage over 50 Kb windows with 20Kb step size in the 11 largest contigs, that contain >90% of the genome. Three individual samples were used: the first external coverage plot corresponds to the HiFi reads of the X. peckii male used for the genome assembly. The middle coverage plot corresponds to short 150 bp Illumina reads from another male, and the internal coverage plot corresponds to short 150 bp Illumina reads from a female. Potential X-linked contigs (red) have half the coverage in both male samples than the rest of the autosomes (blue). In contrast, the female sample has a uniform depth of coverage throughout the plot.

Repeat modeler

As this is the first high quality genome assembly of the order Strepsiptera and there are no repeat libraries available for the order, we identified and classified repetitive elements de novo using RepeatModeler v.2.0.13356. Then, we used RepeatMasker v.4.0.73457 to mask the genome assembly using the de novo library produced by RepeatModeler2. RepeatMasker masked 38.41% of the genome, a similar value to the repeat content suggested by GenomeScope2 before removing the bacterial contigs. Most of the repeats in the genome are unclassified (27.8%). LTR elements are the most abundant elements among the classified repeats (5.74%; Fig. 7; Table 6).

Fig. 7
figure 7

Repeat content classification using RepeatModeler2 and RepeatMasker. The proportion per category of repetitive element found across the genome. Categories include long interspersed elements LINEs (brown), LTR elements (yellow), DNA elements transposons (light green), simple repeats (purple), low complexity (orange) and unclassified repeats (pink).

Table 6 Repetitive elements modeled de novo with RepeatModeler2 (class and content).

Annotation

We annotated the X. peckii genome assembly using the BRAKER358 pipeline. First we used RepeatMasker v.4.0.73457 to soft mask the genome (–xsmall) using the repeat library we generated de novo with RepeatModeler256. Next we used the software STAR (Spliced Transcripts Alignment to a Reference) v2.7.159 to map RNA-seq reads (male; whole body) to the genome. Then we used the mapped reads (–bam) as evidence input for BRAKER358 to generate ab initio gene predictions which resulted in 8759 predicted genes. Additionally, we ran a homology-based gene prediction with the tool GeMoMa v1.960 which allows to incorporate RNA-seq evidence for splice site prediction. We used the annotated proteins from the published genome of Drosophila melanogaster61 and the beetle Tribolium castaneum62 as reference input, and incorporated the previously mapped RNA-seq reads with the following command:

java -Xms5G -Xmx170G -jar GeMoMa-1.9.jar CLI GeMoMaPipeline AnnotationFinalizer.r=NO o=true t=UROC_Xpeckii_1.1.fasta s=own i=Tcast a=./GCF_031307605.1/genomic.gff g=GCF_031307605.1_icTriCast1.1_genomic.fna

s=own i=Dmel a=dmel-all-r6.45.problem.free.subset.gff g=dmel-all-chromosome-r6.56.fasta r=MAPPED ERE.s=FR_FIRST_STRAND ERE.m=STAR_RNA_mapped.bam threads=16 outdir=TriboliumDrosophila. The combination of homology-based and ab-initio gene predictions by GeMoMa resulted in 12,860 genes (GFF file available in Figshare64).

Data Records

De novo whole genome PacBio HiFi assembly and mitochondrial genome assembly were deposited at DDBJ/ENA/GenBank under the accession JAWUEG00000000063. Mitochondrial genome annotation is available as a feature table as well as GenBank format in Figshare64.The raw genomic and transcriptomic Illumina PE sequencing reads were deposited at the NCBI Sequence Read Archive under the SRP accession number SRP48888265. Sample specific accession numbers are SRX23571581 (Male WGS), SRX23571582 (females WGS) and SRX23571583 (Male RNA seq). The GFF files with ab initio gene predictions from BRAKER3 and the homology-based gene predictions from GeMoMa are available in Figshare64.

Technical Validation

GenomeScope2 size estimation validation and comparative analysis with other Xenos species

We used the short 150 bp Illumina PE reads from both male and female X. peckii samples to verify the accuracy of our GenomeScope2 estimations from the raw PacBio HiFi reads. Next, we computed the histogram of k-mer frequencies with Jellyfish v2.3.037 (-C -m 21; histo –h 1,000,000), and ran the output through the online web tool of GenomeScope2 (K-mer length = 21, ploidy = 2 and max K-mer coverage = 1,000,000)38. The estimated genome size using male and female high-coverage short Illumina reads suggests that the genome size ranges between 67,923,714 bp for the male and 75,093,981 bp for the female (Table 3; Supplementary Figure 4). Moreover, the estimated repetitive content of the genome is around 43.7% and 41.4% for the male and female reads, respectively.

To further validate our results and place them in a comparative framework, we performed the same GenomeScope2 estimations with the only two Xenos species for which whole genome shotgun sequence data is available: X. vesparum and X. cf. moutoni (150 bp Illumina PE reads; NCBI Accession number SAMN03323551 and PRJNA681068 respectively)12,14. The genome size of X. vesparum was previously calculated using flow cytometry data and is reported to be ~ 133Mb7. Our GenomeScope estimations are consistent with this previously reported size for X. vesparum, with a slightly smaller genome size of ~ 120.8 Mb. Similar to our X. peckii assembly, the X. vesparum data showed small discrepancies (~10 Mb) between the GenomeScope2 estimations and the genome size calculated using flow cytometry. These results indicate that our estimations closely reflect the genome size of the 3 X. peckii samples that we analyzed. Our results reveal a significant amount of variation in genome size for Xenos, ranging from 58,068,699 bp in X. cf. moutoni to 120,869,754 bp in X. vesparum (Table 3). Importantly, the genome size estimations for our male X. peckii reads are ~10 Mb smaller than the female reads which is consistent with differentiated XY chromosomes. If the X chromosome has slightly higher repetitive content, then we would expect the females to have a larger genome size. However, the repetitive content estimated for both male and female short reads was similar (Table 3). Further studies will confirm if the difference in genome size estimations is due to sequencing or assembly artifacts or due to real biological differences in the genomes of males and females.

Genome assembly metrics with different software

We ran four different assembly software (i.e. Hifiasm39, Improved Phased Assembler IPA40, Hicanu41, Flye42) to assess which one produced the highest quality genome. We evaluated the quality for all assemblies using QUAST v.5.0.246 and BUSCO v.5.2.247 with the insecta_odb data base (Table 4; Fig. 3). Hifiasm produced the overall most contiguous and complete genome assembly, with the lowest rates of duplications reported by BUSCO. Both Flye and Hicanu were inefficient in resolving the assembly, which resulted in large and fragmented assemblies with high duplication according to the BUSCO scores (Fig. 3). Moreover, we used a Jupiter plot43 (−n = 50000, ng = 75) to compare the contiguity and completeness of the assemblies produced by different software. We confirmed that the assembly produced by Hifiasm is more contiguous and complete than the one produced by IPA which was the second-best ranked assembly of our analysis (Supplementary Figure 1b).