First genome assembly of the order Strepsiptera using PacBio HiFi reads reveals a miniature genome

Castaño, María Isabel; Ye, Xinhai; Uy, Floria M. K.

doi:10.1038/s41597-024-03808-w

First genome assembly of the order Strepsiptera using PacBio HiFi reads reveals a miniature genome

Data Descriptor
Open access
Published: 28 August 2024

Volume 11, article number 934, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

First genome assembly of the order Strepsiptera using PacBio HiFi reads reveals a miniature genome

Download PDF

545 Accesses
1 Altmetric
Explore all metrics

Abstract

Twisted-wing insects (Strepsiptera) are an enigmatic order of parasites with unusual life histories and striking sexual dimorphism. Males emerge from hosts as free-living winged adults, while females from most species remain as endoparasites that retain larval traits. Due to scarce genomic data and phylogenetic controversies, Strepsiptera was only recently placed as the closest living relative to beetles (Coleoptera). Here, we report the first PacBio HiFi genome assembly of the strepsipteran Xenos peckii (Xenidae). This de novo assembly size is 72.1 Mb, with a BUSCO score of 87.4%, N50 of 7.3 Mb, 23.4% GC content, and 38.41% repeat content. We identified 8 contigs that contain >75% of the assembly and reflect the haploid chromosome number reported from karyotypic data, and 3 contigs that exhibit sex chromosome coverage patterns. Additionally, the mitochondrial genome is 16,111 bp long and has 37 genes. This long-read assembly for Strepsiptera reveals a miniature genome and provides a unique tool to understand complex genome evolution associated with a parasitic lifestyle and extreme sexual dimorphism.

Chromosome-level genome assembly of Microplitis manilae Ashmead, 1904 (Hymenoptera: Braconidae)

Article Open access 10 May 2023

A chromosomal-level genome assembly of Serrognathus titanus Boisduval, 1835 (Coleoptera: Lucanidae)

Article Open access 15 August 2024

A novel mitochondrial genome architecture in thrips (Insecta: Thysanoptera): extreme size asymmetry among chromosomes and possible recent control region duplication

Article Open access 09 June 2015

Background & Summary

The order Strepsiptera, commonly known as twisted-wing insects, are a group of enigmatic obligate parasites, with 10 extant families and approximately 600 described species¹. High diversification rates associated with adaptive specialization to their hosts has allowed them to parasitize 7 orders and 34 families of insects^2,3. More than 97% of Strepsiptera species are characterized by extreme sexual dimorphism, except for the ancestral family Mengenillidae where winged males and apterous females emerge as free-living adults^3,4. In the remaining families, males develop inside their host and emerge as free-living winged adults, while adult neotenic females remain as obligate endoparasites inside their host^5,6. The grub-like females lack a distinctive head, thorax, abdomen or body appendages and are viviparous, with some of the highest fecundity rates reported for any insect⁷.

A parasitic lifestyle coupled with unusual life history traits have been linked to unique genetic characteristics: exceptionally small genomes reported by flow cytometry, different levels of endoreduplication between males and females, and high rates of molecular evolution^7,8. These unusual genetic and morphological features, combined with scarce genomic data available for the order, hindered efforts to place this order in the insect phylogeny for decades calling it the Strepsiptera problem⁹. However, recent studies used a whole genome shotgun sequencing approach of the basal species Mengenilla moldrzyki (Mengenillidae) to resolve the enigma and placed Strepsiptera confidently as the closest living relative to beetles (Coleoptera)^10,11. Similarly, other studies now provide mitogenomic and transcriptomic data, in addition to phylogenomic comparisons based off shotgun whole genome sequencing for these unique parasites^{12,13,14,15,16,17}.

Parasites often use metabolic resources from their hosts¹⁸. Therefore, selection for metabolic efficiency, such as costs of replicating DNA, can result in smaller genomes^19,20. Similarly, organisms that undergo massive changes of their body plan when transitioning to an obligate endoparasitic life, often experience extensive gene losses as a tradeoff^21,22. In Strepsiptera, genome size reduction or potential gene losses associated with host specialization and extreme sexual dimorphism combine compelling evolutionary scenarios to understand genome evolution⁷. In fact, the only cytogenetic data available for Strepsiptera indicates they have XY sex chromosomes and substantial variation in chromosome numbers within the same family (Xenidae). Specifically, these karyotypic studies report 2n = 16 for Xenos peckii^23,24 and 2n = 8 and for Brasixenos. sp. from Brazil²⁵. Xenidae is one of the most derived families in Strepsiptera and exhibits a high degree of host specialization. The ability to colonize new host lineages resulted in a constant net diversification rate that has produced between 77 and 96 potentially cryptic species²⁶.

Throughout the world, many species of Xenos consistently infect social wasps of the genus Polistes (Fig. 1)^27,28. Studies in two parallel Xenos-Polistes systems in Europe and North America reveal consistent behavioral and physiological manipulation strategies^5,29,30. Parasitic effects include hindering host reproduction and manipulating wasps to abandon their colony to form aggregations that facilitate mating of adult parasites. Thus, members of Xenos evolve in a continuous arms race with their hosts to overcome the host immune system³¹, which also may influence genome evolution. As the genomic revolution advances, sequencing of high-quality Strepsiptera genomes is key to understanding the molecular mechanisms underpinning host manipulation, as well as the forces driving genome size evolution in this enigmatic group of insects.

In this study, we present the first de novo genome assembly of the order Strepsiptera – A male Xenos peckii (Strepsiptera: Xenidae)– using PacBio HiFi sequencing technology. After filtering out bacterial scaffolds, the haploid assembly size is 72.1 Mb, assembled in 227 contigs, a contig N50 of 7.3 Mb, a GC content of 23.4%, BUSCO score of 87.4%, and a repetitive element content of 38.41%. This is the most complete and contiguous genome assembly available for the order Strepsiptera up to date. More than 75% of the genome is contained in 8 major contigs (L75 = 8) and the longest contig is 8.76 Mb. The 8 major contigs reflect the number of chromosomes reported from karyotypic data (n = 8)^23,25. Furthermore, we identified 3 contigs that exhibit sex chromosome type coverage patterns (half depth of coverage in males). Additionally, we assembled and annotated the mitochondrial genome, which is 16111 bp long and has 37 genes. The size and gene content of the mitochondrial genome is similar to that of published mitogenomes for X. vesparum found in Europe and its close relative, X. cf. moutoni from China^13,14, recently named X. yangi¹⁵. To our knowledge, this twisted-wing parasite genome is among the 10 smallest insect genomes reported in the literature, and the smallest insect genome assembled using long PacBio HiFi reads³². Our high-quality genomic data provides a unique tool to improve our understanding of the complex genome evolution associated with extreme sexual dimorphism coupled with a parasitic lifestyle. Additionally, this genome will facilitate the exploration of the molecular mechanisms underlying host and twisted-wing parasite coevolution.

Methods

Specimen collection and sample preparation

During June of 2021, we used entomological nets to collect infected social wasps, Polistes fuscatus, at the Robert H. Treman State Park located in Ithaca, NY, USA. We used taxonomic studies with detailed morphological characteristics to confirm that the parasite was X. peckii^23,33,34. Several studies also included information about voucher specimens, confirming that P. fuscatus is infected by X. peckii in Ithaca, NY^23,33,34. Infection and stage of development of X. peckii was confirmed by the extrusions in the host abdomen, which are morphologically distinct for developing female (N = 10) and male parasites (N = 10)⁵. We transported the infected wasps to the University of Rochester in disposable and transparent deli cups that are pre-punched to increase air flow. Each infected wasp was housed individually in a deli cup with a water vial and a sugar cube and kept under controlled conditions that included: temperature of 25 °C, 40% humidity, and full spectrum light for 12 hours. We inspected infected wasps daily during the extrusion period of 3–7 days and 9–17 days for male and female parasites, respectively. After emerging from a host, male parasites were frozen with dry ice and stored at −80 °C. A single female was dissected out of its host and preserved the same way. A single male X. peckii was preserved for an ultra-low DNA input sample preparation and subsequent long-read sequencing, while an additional male and the female were preserved for short-read sequencing. Finally, we froze one more single male in liquid nitrogen for subsequent extraction of RNA, sequencing, and genome annotation.

DNA and RNA extraction, library preparation, sequencing and pre-assembly read processing

Library preparation and sequencing for PacBio HiFi data was performed at the DNA sequencing and Genotyping Center of the Delaware Biotechnology Institute at University of Delaware. Total genomic DNA was isolated using the Qiagen MagAttract kit (QIAGEN, shanghai, China) following the manufacturer’s instructions. One SMRTbell library of circular consensus sequencing (CCS) was constructed according to the Low DNA input PacBio protocol with additional bead cleaning. The library was then sequenced on one PacBio Sequel II SMRT cell platform in CCS mode, which calls consensus reads from subreads that are generated after multiple passes of the enzyme around the circularized template (Pacific Biosciences). Sequencing yielded 8.6 Gb of data and ~0.9 million HiFi (high-fidelity) reads from the male X. peckii sample (Table 1).

Table 1 General statistics of raw sequencing reads used for X. peckii genome assembly and genome size estimation.

Full size table

We extracted Genomic DNA from the additional male and the female X. peckii using a DNeasy Blood and Tissue kit (QIAGEN, Valencia, CA, USA) for library preparation and short read whole genome Illumina sequencing. RNA from a whole male was extracted with the RNeasy Mini Kit, implementing a DNase digestion step with the RNase-Free DNase Set (QIAGEN, Valencia, CA, USA). DNA and RNA samples were received and validated via standard quality control procedures at Novogene Co., Sacramento, CA. To construct the genomic DNA library, DNA was sheared into short fragments which were end repaired, A-tailed and ligated with Illumina adapter. Fragments with adapters were PCR amplified, size selected and purified. Each library was checked with Qubit and real time PCR for quantification, and then pooled for sequencing on a NovaSeq PE 150 Illumina platform (Novogene Co. Ltd. Sacramento, CA). Sequencing results generated 7.5 Gb and 6.0 Gb of data for the male and female samples, respectively (Table 1).

After sequencing, we assessed read quality using FastQC³⁵. Then, we filtered reads for quality and length, and trimmed adapter sequences and PolyG tails using fastp³⁶ with the options -l 40. Then we re-assessed the quality of the processed reads, which we used to validate the genome size, heterozygosity and repetitive content estimations and to identify sex chromosomes based on coverage patterns. The transcriptome data (total RNAseq) was also sequenced with the NovaSeq PE 150 platform, using an mRNA library preparation with standard poly A enrichment, which generated 6 GB of raw data to implement in the genome annotation.

Genome assembly and size estimation

Prior to assembly, estimating features like genome size, rate of heterozygosity, and repetitive element content proves useful to inform the parameter values that should be used in subsequent steps. We used Jellyfish (v2.3.0)³⁷ to count and compute a histogram of k-mer frequencies from the raw PacBio HiFi reads using the count (-C -m 21) and histo (–h 1,000,000) modules (Table 2).

Table 2 Software and version used for analyses in this study with corresponding parameters, if different from the default options.

Full size table

Then we used the Jellyfish histogram output to run the online web tool of GenomeScope2 (http://qb.cshl.edu/genomescope/genomescope2.0) with the following parameters: K-mer length = 21, ploidy = 2 and max kmer coverage = 1,000,000³⁸. The model suggests that X. peckii has an estimated genome size of 63,567,278 bp with 1.19% heterozygosity levels, and 36.4% of the genome composed of repetitive elements (Fig. 2). The X. peckii genome we report here is one of the smallest insect genomes documented up to date⁷ (Table 3; See technical validation below).

Table 3 GenomeScope estimation of genome structure using reads from different sequencing technologies.

Full size table

After a preliminary survey of the genome structure, we ran four different assembly software optimized for HiFi reads (i.e., Hifiasm³⁹, Improved Phased Assembler IPA⁴⁰, Hicanu⁴¹, Flye⁴²) and computed BUSCO scores to gauge differences in assembly metrics. Hifiasm produced the highest quality genome (see technical validation below; Table 4; Fig. 3).

Table 4 Assembly statistics using four different assembly software.

Full size table

We assembled the HiFi reads without additional data into a draft haplotype-resolved genome assembly using Hifiasm v0.16.1-R341³⁹. We then used the default parameters of Hifiasm to generate a primary and alternate assembly graph after aggressive purging of haplotypic duplications (-primary, -l 3) and three rounds of error correction³⁹. The primary assembly produced by Hifiasm was 73,541,611 bp long, assembled in 229 contigs with an N50 of 7.3 Mb and a GC content of 23.6%. Both the primary (73 Mb) and alternate (64 Mb) assemblies produced by Hifiasm were similar to the estimated genome size produced by GenomeScope2.

However, the primary assembly was slightly larger than the GenomeScope2 estimations (Table 3; Fig. 2). High heterozygosity can result in spurious duplications that increase the genome size because two alleles from the same loci are included in the primary assembly. We used Jupiter Plot (−n = 50000, ng = 75)⁴³ to produce a synteny plot and compare completeness between the primary and alternate assemblies produced by Hifiasm. The primary assembly produced by Hifiasm is more contiguous and the alternate assembly doesn’t include some of the largest contigs in the primary assembly (potential sex-chromosomes based on coverage patterns; Supplementary Figure 1a). Thus, the small discrepancies (<10 Mb) between the GenomeScope genome size estimations and the final assembly size may be due to assembler limitations when sorting or deduplicating repetitive regions.

Genome quality assessment

We used Blobtools2 from the BlobToolKit suite⁴⁴ to screen for contamination. First, we performed a BLASTn⁴⁵ search of our assembly against the general RefSeq blast -nt database using the following parameters: -outfmt ‘6 qseqid staxids bitscore std’ -max_target_seqs 1 -num_threads 12 -evalue 1e-6. Then we used the function blobtools –add to create a BlobDir database that included the blast output (hits file), the read coverage (bam file) from mapping the raw reads back to the assembly, and the BUSCO scores (Busco summary file; see below). We implemented the BlobToolKit v.1.1.1⁴⁴ online Viewer (blobtools host ‘pwd’) to create a Blobplot with the hits of our bacterial contamination scan, and their respective coverage (Supplementary Figure 2). We removed two contigs that matched the phylum Proteobacteria. Specifically, one whole contig was the genome of Wolbachia pipitensis (see technical validation below; Supplementary Figure 2). Then, we used QUAST v3⁴⁶ to assess the metrics of the curated assembly. The final assembly size was 72,105,243 Mb, assembled in 227 contigs with a GC content of 23.4%. We used the BlobtoolKit v.1.1.1⁴⁴ to create a SnailPlot to visualize our assembly statistics (Table 5; Fig. 4). This Whole Genome PacBio HiFi assembly project has been deposited at DDBJ/ENA/GenBank under the accession JAWUEG000000000⁶³.

Table 5 Statistics produced by QUAST for final assembly using Hifiasm after removing bacterial contamination.

Full size table

We evaluated assembly quality and completeness using Benchmarking Universal Single-copy Orthologs (BUSCO) v.5.2.2⁴⁷ with the insecta_odb10 database. From a total of 1367 BUSCO groups searched, our assembly has 87.4% of genes complete and present in single copy, 0.7% duplicated genes, 1.9% fragmented genes, and 10.4% of the genes missing (Fig. 3). Additionally, we evaluated our assembly against the 2124 BUSCO groups from the endopterygota_odb10 database. This assessment shows 74.8% of complete and single copy genes, 1.1% duplicated genes, 6.5% fragmented genes, and 17.6% missing genes. The underlying cause behind the high proportion of missing genes from both datasets will be clarified once more genomic data for the order becomes available.

Mitochondrial genome assembly

We assembled the mitochondrial (mtDNA) genome of X. peckii from the raw PacBio HiFi reads with the MitoHiFi pipeline^48,49, which uses a reference-guided method to perform the assembly. First, the pipeline implemented the software MitoFinder⁴⁹ to scan all the mtDNA assemblies available on NCBI and downloads the highest quality reference mitogenome for the most closely related taxa. PacBio HiFi reads were mapped to the reference mitogenome using Minimap2 v.2.17⁵⁰. Then, all the mapping reads were assembled de novo with Hifiasm³⁹ into a mitogenome. A mitochondrial genome is available for X. vesparum (partial genome: 14,519 bp NCBI Accession number DQ364229). In addition, two independent mitogenomes were found for X. yangi, previously X. cf. moutoni (partial genome: 16,717 bp NCBI Accession number MW222190 and partial genome: 15,324 bp NCBI Accession number OK329871)^13,14,15. Following the MitoFinder output, we used the mitochondrial genome provided under X. yangi, as the starting reference sequence, because it is the most complete sequence available¹⁵. Then, we used MitoHiFi (-p 90 -o 5) to assemble the mitogenome. A total of 30,623 HiFi reads mapped to the reference and were used for de novo assembly with Hifiasm. The final mitogenome size assembled for X. peckii was 16,111 bp (NCBI Accesion JAWUEG000000000)⁶³. To confirm the accuracy of the mitogenome, we used NOVOPlasty⁵¹ to assemble an independent mitogenome de novo. We used the Illumina 150 bp short reads from the male X. peckii (Table 1) and the COX1 X. peckii gene sequence available on NCBI (Accesion JN082808)⁸ for the starting reference, as input for NOVOPlasty. This de novo NOVOPlasty assembly resulted in a mitogenome of 15,946 bp, a similar size and syntenic to the one assembled with MitoHiFi (Supplementary Figure 3).

Subsequently, we used the software MITOS2⁵² to annotate the mitochondrial genome. The sequence spans the full region between rrnL and ND2, contains 37 genes and is circularized (Fig. 5a). After manual inspection and curation of the mitogenome, we found that MITOS2 was unable to annotate the complete region of the 16S rRNA gene (i.e., rrnS). In Strepsiptera, a recent study highlights that the rrnS gene appears to have highly divergent regions in the 5’ section of the gene that is flanked by trnV, which hinders the annotation process¹⁷. Additionally, to assess the coverage along the sequence, we mapped back to the assembly all the reads that aligned to the original reference X. yangi mitogenome, as well as all the HiFi reads available. We found relatively homogeneous coverage along the sequence, except in the first 2 kb which is likely composed of repetitive elements (Fig. 5b,c). Finally, we used BLAST + v2.1⁴⁵ to search for matches of our mitochondrial assembly in the nuclear genome and filtered out contigs from the nuclear genome with a percentage of sequence identity >99% and of smaller size than the mtDNA sequence⁵³.

Mapping rate and coverage for X-linked contigs discovery

Cytogenetic data for the order Strepsiptera is only available for two species, and it indicates that they have heteromorphic X and Y chromosomes^12,54. Specifically, the number of chromosomes of X. peckii was identified as 2n = 16^23,25. To evaluate mapping rate and depth of coverage along the assembly we used Minimap2 v.2.17⁵⁰ to map the raw long PacBio HiFi reads, as well as the short Illumina (150 bp) PE reads, back to the final assembly. The male X. peckii HiFi reads had an average mapping rate of 99.43% and 97.8% coverage, with a mean depth of coverage of 117X. For the Illumina PE reads the mapping rate was 94.95% for the male and 77.61% for the female, with a mean depth of coverage of 97.27X and 63.73X respectively. We used the CIRCOS⁵⁵ tool to visualize the depth of coverage through the first 11 contigs that contain >90% of the genome. We identified three X-linked contigs (ctg02, ctg23 and ctg15), which presented a male to female coverage ratio of 0.5. Interestingly, ctg02 is also the largest contig in our assembly (Fig. 6).

Repeat modeler

As this is the first high quality genome assembly of the order Strepsiptera and there are no repeat libraries available for the order, we identified and classified repetitive elements de novo using RepeatModeler v.2.0.133⁵⁶. Then, we used RepeatMasker v.4.0.734⁵⁷ to mask the genome assembly using the de novo library produced by RepeatModeler2. RepeatMasker masked 38.41% of the genome, a similar value to the repeat content suggested by GenomeScope2 before removing the bacterial contigs. Most of the repeats in the genome are unclassified (27.8%). LTR elements are the most abundant elements among the classified repeats (5.74%; Fig. 7; Table 6).

Table 6 Repetitive elements modeled de novo with RepeatModeler2 (class and content).

Full size table

Annotation

We annotated the X. peckii genome assembly using the BRAKER3⁵⁸ pipeline. First we used RepeatMasker v.4.0.734⁵⁷ to soft mask the genome (–xsmall) using the repeat library we generated de novo with RepeatModeler2⁵⁶. Next we used the software STAR (Spliced Transcripts Alignment to a Reference) v2.7.1⁵⁹ to map RNA-seq reads (male; whole body) to the genome. Then we used the mapped reads (–bam) as evidence input for BRAKER3⁵⁸ to generate ab initio gene predictions which resulted in 8759 predicted genes. Additionally, we ran a homology-based gene prediction with the tool GeMoMa v1.9⁶⁰ which allows to incorporate RNA-seq evidence for splice site prediction. We used the annotated proteins from the published genome of Drosophila melanogaster⁶¹ and the beetle Tribolium castaneum⁶² as reference input, and incorporated the previously mapped RNA-seq reads with the following command:

java -Xms5G -Xmx170G -jar GeMoMa-1.9.jar CLI GeMoMaPipeline AnnotationFinalizer.r=NO o=true t=UROC_Xpeckii_1.1.fasta s=own i=Tcast a=./GCF_031307605.1/genomic.gff g=GCF_031307605.1_icTriCast1.1_genomic.fna

s=own i=Dmel a=dmel-all-r6.45.problem.free.subset.gff g=dmel-all-chromosome-r6.56.fasta r=MAPPED ERE.s=FR_FIRST_STRAND ERE.m=STAR_RNA_mapped.bam threads=16 outdir=TriboliumDrosophila. The combination of homology-based and ab-initio gene predictions by GeMoMa resulted in 12,860 genes (GFF file available in Figshare⁶⁴).

Data Records

De novo whole genome PacBio HiFi assembly and mitochondrial genome assembly were deposited at DDBJ/ENA/GenBank under the accession JAWUEG000000000⁶³. Mitochondrial genome annotation is available as a feature table as well as GenBank format in Figshare⁶⁴.The raw genomic and transcriptomic Illumina PE sequencing reads were deposited at the NCBI Sequence Read Archive under the SRP accession number SRP488882⁶⁵. Sample specific accession numbers are SRX23571581 (Male WGS), SRX23571582 (females WGS) and SRX23571583 (Male RNA seq). The GFF files with ab initio gene predictions from BRAKER3 and the homology-based gene predictions from GeMoMa are available in Figshare⁶⁴.

Technical Validation

GenomeScope2 size estimation validation and comparative analysis with other Xenos species

We used the short 150 bp Illumina PE reads from both male and female X. peckii samples to verify the accuracy of our GenomeScope2 estimations from the raw PacBio HiFi reads. Next, we computed the histogram of k-mer frequencies with Jellyfish v2.3.0³⁷ (-C -m 21; histo –h 1,000,000), and ran the output through the online web tool of GenomeScope2 (K-mer length = 21, ploidy = 2 and max K-mer coverage = 1,000,000)³⁸. The estimated genome size using male and female high-coverage short Illumina reads suggests that the genome size ranges between 67,923,714 bp for the male and 75,093,981 bp for the female (Table 3; Supplementary Figure 4). Moreover, the estimated repetitive content of the genome is around 43.7% and 41.4% for the male and female reads, respectively.

To further validate our results and place them in a comparative framework, we performed the same GenomeScope2 estimations with the only two Xenos species for which whole genome shotgun sequence data is available: X. vesparum and X. cf. moutoni (150 bp Illumina PE reads; NCBI Accession number SAMN03323551 and PRJNA681068 respectively)^12,14. The genome size of X. vesparum was previously calculated using flow cytometry data and is reported to be ~ 133Mb⁷. Our GenomeScope estimations are consistent with this previously reported size for X. vesparum, with a slightly smaller genome size of ~ 120.8 Mb. Similar to our X. peckii assembly, the X. vesparum data showed small discrepancies (~10 Mb) between the GenomeScope2 estimations and the genome size calculated using flow cytometry. These results indicate that our estimations closely reflect the genome size of the 3 X. peckii samples that we analyzed. Our results reveal a significant amount of variation in genome size for Xenos, ranging from 58,068,699 bp in X. cf. moutoni to 120,869,754 bp in X. vesparum (Table 3). Importantly, the genome size estimations for our male X. peckii reads are ~10 Mb smaller than the female reads which is consistent with differentiated XY chromosomes. If the X chromosome has slightly higher repetitive content, then we would expect the females to have a larger genome size. However, the repetitive content estimated for both male and female short reads was similar (Table 3). Further studies will confirm if the difference in genome size estimations is due to sequencing or assembly artifacts or due to real biological differences in the genomes of males and females.

Genome assembly metrics with different software

We ran four different assembly software (i.e. Hifiasm³⁹, Improved Phased Assembler IPA⁴⁰, Hicanu⁴¹, Flye⁴²) to assess which one produced the highest quality genome. We evaluated the quality for all assemblies using QUAST v.5.0.2⁴⁶ and BUSCO v.5.2.2⁴⁷ with the insecta_odb data base (Table 4; Fig. 3). Hifiasm produced the overall most contiguous and complete genome assembly, with the lowest rates of duplications reported by BUSCO. Both Flye and Hicanu were inefficient in resolving the assembly, which resulted in large and fragmented assemblies with high duplication according to the BUSCO scores (Fig. 3). Moreover, we used a Jupiter plot⁴³ (−n = 50000, ng = 75) to compare the contiguity and completeness of the assemblies produced by different software. We confirmed that the assembly produced by Hifiasm is more contiguous and complete than the one produced by IPA which was the second-best ranked assembly of our analysis (Supplementary Figure 1b).

Code availability

No specific code or script was used in this study. Data processing was executed following the documentation of the corresponding software described in the methods. Software programs with no parameters associated were used with the default settings.

References

Cook, J. L. Annotated catalog of the order Strepsiptera of the world. Trans. Am. Entomol. Soc. 145, 121–267 (2019).
Article Google Scholar
Benda, D. Evolution of host specialisation, phylogeography and taxonomic revision of Xenidae (Strepsiptera). (Doctoral dissertation, Charles University, Faculty of Science, Prague, 2023).
Kathirithamby, J. Host-parasitoid associations in Strepsiptera. Annu. Rev. Entomol. 54, 227–249 (2009).
Article CAS PubMed Google Scholar
Boussau, B. et al. Strepsiptera, phylogenomics and the long branch attraction problem. PLoS ONE 9, e107709 (2014).
Article ADS PubMed PubMed Central Google Scholar
Hrabar, M., Danci, A., McCann, S., Schaefer, P. W. & Gries, G. New findings on life history traits of Xenos peckii (Strepsiptera: Xenidae). Can. Entomol. 146, 514–527 (2014).
Article Google Scholar
Richter, A., Wipfler, B., Beutel, R. & Pohl, H. The female cephalothorax of Xenos vesparum Rossi, 1793 (Strepsiptera: Xenidae). Arthropod Syst. Phylogeny 75, 327–347 (2017).
Article Google Scholar
Johnston, J. S., Ross, L. D., Beani, L., Hughes, D. P. & Kathirithamby, J. Tiny genomes and endoreduplication in Strepsiptera. Insect Mol. Biol. 13, 581–585 (2004).
Article CAS PubMed Google Scholar
McMahon, D. P., Hayward, A. & Kathirithamby, J. The first molecular phylogeny of Strepsiptera (Insecta) reveals an early burst of molecular evolution correlated with the transition to endoparasitism. PLoS ONE 6, e21206 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Kristensen, N. P. Phylogeny of insect orders. Annu. Rev. Entomol. 26, 135–157 (1981).
Article Google Scholar
Pohl, H. & Beutel, R. G. The Strepsiptera-odyssey: the history of the systematic placement of an enigmatic parasitic insect order. Entomologia 1, e4 (2013).
Article Google Scholar
Niehuis, O. et al. Genomic and morphological evidence converge to resolve the enigma of Strepsiptera. Curr. Biol. 22, 1309–1313 (2012).
Article CAS PubMed Google Scholar
Mahajan, S. & Bachtrog, D. Partial dosage compensation in Strepsiptera, a sister group of Beetles. Genome Biol. Evol. 7, 591–600 (2015).
Article PubMed PubMed Central Google Scholar
Carapelli, A. et al. The mitochondrial genome of the entomophagous endoparasite Xenos vesparum (Insecta: Strepsiptera). Gene 376, 248–259 (2006).
Article CAS PubMed Google Scholar
Zhang, R. et al. The mitochondrial genome of one ‘twisted-wing parasite’ Xenos cf. moutoni (Insecta, Strepsiptera, Xenidae) from Gaoligong Mountains, Southwest of China. Mitochondrial DNA B Resour 6, 512–514 (2021).
Article CAS PubMed PubMed Central Google Scholar
Dong, Z., Liu, X., Mao, C., He, J. & Li, X. Xenos yangi sp. nov.: A new twisted-wing parasite species (Strepsiptera, Xenidae) from Gaoligong Mountains, Southwest China. Zookeys 1085, 11–27 (2022).
Article PubMed PubMed Central Google Scholar
Lähteenaro, M. et al. Phylogenomic species delimitation of the twisted-winged parasite genus Stylops (Strepsiptera). Syst. Entomol. 49, 294–313 (2024).
Article Google Scholar
Towett-Kirui, S., Morrow, J. L. & Riegler, M. Substantial rearrangements, single nucleotide frameshift deletion and low diversity in mitogenome of Wolbachia-infected strepsipteran endoparasitoid in comparison to its tephritid hosts. Sci. Rep. 12, 477 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Hechinger, R., Lafferty, K. & Kuris, A. Parasites. in Metabolic Ecology: A Scaling Approach (eds. Sibly, R. M., Brown, J. & Kodric-Brown, A.) 234–247 (John Wiley and Sons, New Jersey, 2012).
Ogburn, N. T. Genome size and host specialization in parasites. (Master’s thesis, University of South Florida, Tampa, Florida, USA, 2019).
Jackson, A. P. The evolution of parasite genomes and the origins of parasitism. Parasitology 142, S1–S5 (2015).
Article PubMed Google Scholar
Spanu, P. D. et al. Genome expansion and gene loss in powdery mildew fungi reveal tradeoffs in extreme parasitism. Science 330, 1543–1546 (2010).
Article ADS CAS PubMed Google Scholar
Sun, G. et al. Large-scale gene losses underlie the genome evolution of parasitic plant Cuscuta australis. Nat. Commun. 9, 2683 (2018).
Article ADS PubMed PubMed Central Google Scholar
Rieder, C. L. & Nowogrodzki, R. Intranuclear membranes and the formation of the first meiotic spindle in Xenos peckii (Acroschismus wheeleri) oocytes. J. Cell Biol. 97, 1144–1155 (1983).
Article CAS PubMed Google Scholar
Schrader, S. H. Reproduction in Acroschismus wheeleri Pierce. J. Morphol. 39, 157–205 (1924).
Article Google Scholar
Ferreira, A., Cella, D. M., Mesa, A. & Virkki, N. Cytology and systematical position of stylopids (Strepsiptera). Hereditas 100, 51–52 (1984).
Article Google Scholar
Benda, D., Votýpková, K., Nakase, Y. & Straka, J. Unexpected cryptic species diversity of parasites of the family Xenidae (Strepsiptera) with a constant diversification rate over time. Syst. Entomol. 46, 252–265 (2021).
Article Google Scholar
Benda, D., Nakase, Y. & Straka, J. Frozen Antarctic path for dispersal initiated parallel host-parasite evolution on different continents. Mol. Phylogenet. Evol. 135, 67–77 (2019).
Article PubMed Google Scholar
Hughes, D. P., Kathirithamby, J. & Beani, L. Prevalence of the parasite Strepsiptera in adult Polistes wasps: field collections and literature overview. Ethol. Ecol. Evol. 16, 363–375 (2004).
Article Google Scholar
Beani, L. et al. When a parasite breaks all the rules of a colony: morphology and fate of wasps infected by a strepsipteran endoparasite. Anim. Behav. 82, 1305–1312 (2011).
Article Google Scholar
Gandia, K. M. et al. Caste, sex, and parasitism influence brain plasticity in a social wasp. Front. Ecol. Evol. 10, 781984 (2022).
Article Google Scholar
Manfredini, F., Benati, D. & Beani, L. The strepsipteran endoparasite Xenos vesparum alters the immunocompetence of its host, the paper wasp Polistes dominulus. J. of Insect Physiol. 56, 253–259 (2010).
Article CAS Google Scholar
Hotaling, S. et al. Long reads are revolutionizing 20 years of insect genome sequencing. Genome Biol. Evol. 13, evab138 (2021).
Article PubMed PubMed Central Google Scholar
Garza, C. & Cook, J. L. The taxonomy of adult females in the genus Xenos (Strepsiptera: Xenidae) with a re-description of the females of three North American species. Kent 93, 298–312 (2021).
Article Google Scholar
Benda, D., Pohl, H., Nakase, Y., Beutel, R. & Straka, J. A generic classification of Xenidae (Strepsiptera) based on the morphology of the female cephalothorax and male cephalotheca with a preliminary checklist of species. Zookeys 1093, 1–134 (2022).
Article PubMed PubMed Central Google Scholar
Babraham Bioinformatics. FastQC: A quality control tool for high throughput sequence data https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (2022).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Article PubMed PubMed Central Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Article PubMed PubMed Central Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Pacific Biosciences. Improved Phased Assembler (IPA). GitHub https://github.com/PacificBiosciences/pbipa (2021).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Article CAS PubMed Google Scholar
Chu, J. Jupiter Plot: a Circos-based tool to visualize genome assembly consistency (1.0). Zenodo https://doi.org/10.5281/zenodo.1241235 (2018).
Challis, R., Richards, E., Rajan, J., Cochrane, G. & Blaxter, M. BlobToolKit – Interactive Quality Assessment of Genome Assemblies. G3: Genes Genom. Genet. 10, 1361–1374 (2020).
Article CAS Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Article PubMed PubMed Central Google Scholar
Mikheenko, A., Saveliev, V. & Gurevich, A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32, 1088–1090 (2016).
Article CAS PubMed Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molec. Biol. Evol. 38, 4647–4654 (2021).
Article CAS PubMed PubMed Central Google Scholar
Uliano-Silva, M. et al. MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. BMC Bioinformatics 24, 288 (2023).
Article CAS PubMed PubMed Central Google Scholar
Allio, R. et al. MitoFinder: Efficient automated large-scale extraction of mitogenomic data in target enrichment phylogenomics. Molec. Ecol. Res. 20, 892–905 (2020).
Article CAS Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Dierckxsens, N., Mardulyn, P. & Smits, G. NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res. 45, e18 (2017).
PubMed Google Scholar
Bernt, M. et al. MITOS: improved de novo metazoan mitochondrial genome annotation. Mol. Phylogenet. Evol. 69, 313–319 (2013).
Article PubMed Google Scholar
DeRaad, D. A. et al. De novo assembly of a chromosome-level reference genome for the California scrub-jay, Aphelocoma californica. J. Hered. 114, 95–108, https://doi.org/10.1093/jhered/esad047 (2023).
Article CAS Google Scholar
Blackmon, H., Ross, L. & Bachtrog, D. Sex determination, sex chromosomes, and karyotype evolution in insects. J. Hered. 108, 78–93 (2017).
Article CAS PubMed Google Scholar
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009).
Article CAS PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl Acad. Sci. USA 117, 9451–9457 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0. http://www.repeatmasker.org (2013-2015).
Gabriel, L. et al. BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Res. 34(5), 769–777 (2024).
Article PubMed PubMed Central Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS PubMed Google Scholar
Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Methods Mol. Biol. 1962, 161–177 (2019).
Article CAS PubMed Google Scholar
FlyBase. Drosophila melanogaster Annotation, Release 6.54. NCBI Datasets https://www.ncbi.nlm.nih.gov/datasets/gene/GCA_000001215.4/ (2024).
NCBI. Tribolium castaneum (Red Flour Beetle) icTriCast1.1 (GCF_031307605.1) annotation release GCF_031307605.1-RS_2024_04. NCBI Datasets https://www.ncbi.nlm.nih.gov/datasets/genome/annotation/GCF_031307605.1/ (2024).
Castaño, MI., Ye, X., & Uy, FMK. Xenos peckii isolate 2022_WS26, whole genome shotgun sequencing project, GenBank, https://identifiers.org/ncbi/insdc:JAWUEG000000000.1 (2024).
Castaño, M. I., Ye, X. & Uy, F. M. K. First genome assembly of the order Strepsiptera using PacBio HiFi reads reveals a miniature genome. Figshare https://doi.org/10.6084/m9.figshare.c.7085338.v1 (2024).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRP488882 (2024).

Download references

Acknowledgements

This work was supported by funding provided by the University of Rochester to M.I.C. and F.M.K.U. Special thanks to J. Albert C. Uy, Faye Romero, Christine Muirhead, Jack Werren and Felix Beaudry for invaluable advice. Emiliano Martí and Juan Martín Ferro provided crucial advice for analyses and code to run the genome annotation. Amanda Larracuente, and members of the TropBioLab at the University of Rochester gave feedback which significantly improved this manuscript. Erin Bernberg, Olga Shevchenko, and Brewster Kingham at the University of Delaware sequencing facility provided thoughtful logistical support. We thank Sloan Tomlinson and Adam Fenster for insect photographs, and James Brophy at Robert H. Truman Park for field support. Samples were collected under permit OPHRP FL01 issued by the New York Office of State Parks.

Author information

Authors and Affiliations

University of Rochester, Department of Biology, Rochester, NY, USA
María Isabel Castaño & Floria M. K. Uy
College of Advanced Agriculture Science, Zhejiang A&F University, Hangzhou, Zhejiang, China
Xinhai Ye

Authors

María Isabel Castaño
View author publications
You can also search for this author in PubMed Google Scholar
Xinhai Ye
View author publications
You can also search for this author in PubMed Google Scholar
Floria M. K. Uy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.I.C. and F.M.K.U. designed the study. F.M.K.U. collected the samples and performed DNA and RNA extractions. M.I.C. performed the analyses. X.Y. provided valuable code and input to perform the genome assembly. M.I.C. and F.M.K.U. wrote the first version of the manuscript. All authors read, revised, and approved the final manuscript.

Corresponding authors

Correspondence to María Isabel Castaño or Floria M. K. Uy.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Castaño, M.I., Ye, X. & Uy, F.M.K. First genome assembly of the order Strepsiptera using PacBio HiFi reads reveals a miniature genome. Sci Data 11, 934 (2024). https://doi.org/10.1038/s41597-024-03808-w

Download citation

Received: 11 March 2024
Accepted: 20 August 2024
Published: 28 August 2024
DOI: https://doi.org/10.1038/s41597-024-03808-w
Springer Nature Limited

First genome assembly of the order Strepsiptera using PacBio HiFi reads reveals a miniature genome

Abstract

Similar content being viewed by others

Chromosome-level genome assembly of Microplitis manilae Ashmead, 1904 (Hymenoptera: Braconidae)

A chromosomal-level genome assembly of Serrognathus titanus Boisduval, 1835 (Coleoptera: Lucanidae)

A novel mitochondrial genome architecture in thrips (Insecta: Thysanoptera): extreme size asymmetry among chromosomes and possible recent control region duplication

Background & Summary

Methods

Specimen collection and sample preparation