Chromosome-level genome assembly of the cottony cushion scale Icerya purchasi

Deng, Jun; Zhang, Lin; Zhang, Hui; Wang, Xubo; Huang, Xiaolei

doi:10.1038/s41597-024-03502-x

Chromosome-level genome assembly of the cottony cushion scale Icerya purchasi

Data Descriptor
Open access
Published: 17 June 2024

Volume 11, article number 639, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

Chromosome-level genome assembly of the cottony cushion scale Icerya purchasi

Download PDF

Jun Deng¹,
Lin Zhang¹,
Hui Zhang¹,
Xubo Wang^2,3 &
…
Xiaolei Huang ORCID: orcid.org/0000-0002-6839-9922¹

541 Accesses
1 Altmetric
Explore all metrics

Abstract

The cottony cushion scale, Icerya purchasi, a polyphagous pest, poses a significant threat to the global citrus industry. The hermaphroditic self-fertilization observed in I. purchasi is an exceptionally rare reproductive mode among insects. In this study, we successfully assembled a chromosome-level genome sequence for I. purchasi using PacBio long-reads and the Hi-C technique, resulting in a total size of 1,103.38 Mb and a contig N50 of 12.81 Mb. The genome comprises 14,046 predicted protein-coding genes, with 462,722,633 bp occurrence of repetitive sequences. BUSCO analysis revealed a completeness score of 93.20%. The genome sequence of I. purchasi serves as a crucial resource for comprehending the reproductive modes in insects, with particular emphasis on hermaphroditic self-fertilization.

A chromosome-level genome assembly of the forestry pest Coronaproctus castanopsis

Article Open access 17 February 2024

Chromosome level genome assembly of oriental armyworm Mythimna separata

Article Open access 08 September 2023

A chromosome-level genome assembly of Sesamia inferens

Article Open access 25 January 2024

Background & Summary

The cottony cushion scale, Icerya purchasi Maskell (Monophlebidae: Icerya), is a destructive pest well-known for infecting citrus and lemon plants, with a wide distribution across approximately 150 countries in America, Asia, Europe, and Oceania¹. Understanding the observed diversity in the reproduction modes across different taxa has become a key focus in evolutionary biology. Scale insects (Hemiptera: Sternorrhyncha: Coccoidea), comprising 8535 species, exhibit significant variation in their reproduction modes, including paternal genome elimination, haplodiploidy, parthenogenesis, diplodiploidy, and hermaphroditism^2,3. I. purchasi displays a unique reproductive mode among insects, which is rare in arthropods^4,5. Hermaphrodites with a female-like appearance produce sperm and undergo self-fertilization⁶. Male individuals are scarce in the wild and occasionally mate with hermaphrodites, highlighting the colonization advantages of a selfing organism and the periodic reintroduction of genetic variation⁶. To date, there are seven chromosome-level genomes belonging to scale insects available from GenBank, two of which have been published as research papers^7,8. A high-quality reference genome of I. purchasi would be invaluable for comprehending the mechanism of hermaphroditic self-fertilization.

In this study, we generated 61.21 Gb of Illumina short-read sequencing, 36.4 Gb of PacBio long-read sequencing, and 120.71 Gb of High-throughput chromosome conformation capture (Hi-C) sequencing. The final genome assembly size was 1,103.38 Mb, with a contig N50 of 12.81 Mb and scaffold N50 of 601.66 Mb (Fig. 1). The total length of the genome assembly was comparable to the estimated genome size (1,097.12 Mb) based on k = 21 kmer analysis (Fig. 2a). Following Hi-C assembly and manual adjustment of the heat map, a total sequence length of 1,102.86 Mb was mapped onto two chromosomes, representing 99.95% of the genome. Within the sequences mapped to chromosomes, the length of sequences with determined orientation and direction was 1,098.95 Mb, constituting 99.64% of the total length of the mapped chromosome sequences.

Based on the analysis of tandem repeats and interspersed repeats, we identified 462,722,633 bp of transposable element (TE) sequences, comprising 41.94% of the total, along with 35,341,500 bp of tandem repeat sequences, accounting for 3.20% (Fig. 3). Gene prediction involved homology-based prediction, ab initio prediction, and transcript-based prediction, resulting in a total of 14,046 predicted genes (Fig. 4). Further assessment through BUSCO analysis based on the insecta (odb_10) database indicated the presence of 93.20% in our predicted genes, indicating a high integrity of gene prediction. Annotation analysis of the predicted gene sequences against databases such as NR, EggNOG, GO, KEGG, SWISS-PROT, and Pfam yielded comprehensive annotation results, with a total of 89.71% of genes annotated to the databases (Table 1).

Table 1 Summary of genome annotations for I. purchasi.

Full size table

Methods

Sample collection and genome sequencing

I. purchasi was collected in 2020 from a lemon tree in Ningde, Fujian Province, China (26°34′47″N, 118°44′15″E). The laboratory colony was cultured on lemon trees for four generations in the insectarium under controlled conditions of 27 ± 1 °C and 60% relative humidity, with a photoperiod of 14:10 (L:D). Adult females and nymphs were collected for sequencing.

To perform genome sequencing, the samples were washed with 75% ethanol and sterile water, then stored in cryogenic vials with the addition of 95% ethanol. Subsequently, the total DNA from the entire body was extracted using the CATB method⁹ to construct three different sequencing libraries, followed by sequencing according to standard procedures. Firstly, for genome survey, we used the NEBNext® Ultra™ II DNA Library Prep Kit to build an Illumina short-read library, which was sequenced on the Illumina NovaSeq 6000 platform. Secondly, we constructed PacBio library using the SMRTbell Template Prep Kit and performed HiFi sequencing on the PacBio Sequel II platform. Finally, to generate a chromosome-level genome, we used the Mate-pair Kit to build Hi-C library, which was sequenced on the Illumina NovaSeq 6000 platform.

For transcriptome sequencing, we employed the Trizol method¹⁰ to extract RNA from fresh samples. The NEBNext® Ultra™ RNA Library Prep Kit was used to construct RNA library, which was sequenced on the Illumina NovaSeq 6000 platform for transcriptome sequencing. Additionally, the Ligation Sequencing Kit (SQK-LSK110; ONT) was used to construct library, and full-length transcriptome sequencing was completed on the Oxford Nanopore PromethION platform.

Genome assembly

K-mer analysis was conducted using Jellyfish 2.1.4¹¹ based on short-read data. Genomescope 2.0¹² (-k 21 -p 2 -m 100,000) was used to estimate genome size and heterozygosity. The PacBio Sequel II platform generated a total of 36.4 Gb of long reads. These reads were assembled using Hifiasm v0.14¹³. For chromosomal scaffolding, Hi-C data were aligned to the contig-level genome sequence using Bwa v0.7.17¹⁴. Paired reads with mate mapped to a different contig were used to do the Hi-C associated scaffolding. Then, self-ligation, non-ligation and other invalid reads were filtered out. The uniquely mapped data were retained for assembly using LACHESIS¹⁵. Any two segments which showed inconsistent connection with information from the raw scaffold were checked manually. These corrected scaffolds were divided into subgroups and sorted and oriented into pseudomolecules with LACHESIS (CLUSTER_MIN_RE_SITES = 35; CLUSTER_MAX_LINK_DENSITY = 2; ORDER_MIN_N_RES_IN_TRUNK = 98; ORDER_MIN_N_RES_IN_SHREDS = 97). After this step, placement and orientation errors featuring obvious discrete chromatin interaction patterns were manually adjusted, and the Hi-C interaction heatmap was constructed using the ggplot2¹⁶ software in the R package. Finally, high quality assembly at chromosome level was obtained, and a snail plot of genomic features was generated by Blobotootlkit¹⁷.

TE annotation

Transposon element (TE) was identified by a combination of homology-based and de novo approaches. Initially, de novo prediction was conducted using RepeatModeler2 v2.0.1¹⁸, which invoked RECON v1.0.8¹⁹ and RepeatScout v1.0.6²⁰. The predicted results were categorized using RepeatClassifier¹⁸ with the assistance of the Dfam database v3.5²¹. Subsequently, LTR _retriever v2.9.0²² was utilized for de novo prediction specifically targeting Long Terminal Repeats (LTRs). Full-length long terminal repeat retrotransposons (fl-LTR-RTs) were identified using both LTRharvest v1.5.10²³ and LTR_FINDER v1.07²⁴ (ltr_finder -w 2 -C -D), resulting in the generation of high-quality intact fl-LTR-RTs and a non-redundant LTR library. The de novo prediction results were integrated with the known databases to construct a non-redundant species-specific TE library. Employing RepeatMasker v4.1.2²⁵ (repeatmasker -nolow -no_is -norna -engine wublast -parallel 8 -qq), homology searches were performed on the library to identify and classify TE sequences in the final chromosome genomes. Concatenated repeat sequences were annotated by MIcroSAtellite identification tool (MISA v2.1)²⁶ and Tandem Repeat Finder v4.09²⁷ (trf 2 7 7 80 10 50 500 -d -h).

Gene prediction and annotation

We integrated three approaches (ab initio prediction, homology search and transcript-based assembly) to annotate protein-coding genes in the genome. Ab initio prediction was conducted using Augustus v3.1.0²⁸ and SNAP (2006-07-28)²⁹. GeMoMa v1.7³⁰ was utilized for Homology-based prediction, comparing gene features with those of cotton mealybug (P. solenopsis), the fruit fly (Drosophila melanogaster), pea aphids (Acyrthosiphon pisum), and tobacco whitefly (Bemisia tabaci). For transcript-based prediction, RNA sequencing data were mapped to the reference genome using Hisat v2.0. 4³¹ (hisat2–dta -p 10), and assembled using Stringtie v1.2.3³² (stringtie -p 2). GeneMarkS-T v5.1³³ was employed for gene prediction based on assembled transcripts. The PASA v2.4.1³⁴ was used to predict genes based on the unigenes (and full-length transcripts from the ONT sequencing) assembled by Trinity v2.11³⁵ (Trinity–genome_guided_bam). Finally, predictions from these three methods were integrated using EVM v1.1.1³⁶ and modified with PASA. Gene prediction completeness was assessed using BUSCO v5.2.2³⁷ (busco -m prot). Gene functions were deduced through comprehensive sequence alignments against several databases including the NCBI Non-Redundant (NR), EggNOG, KOG, TrEMBL, InterPro, and Swiss-Prot, employing diamond blastp (diamond v0.9.29³⁸) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database with an E-value threshold of 1E-3. Protein domains were annotated via InterProScan v5.34³⁹ utilizing InterPro protein databases, while PFAM databases was used the identification of motifs and domains within gene models. Gene Ontology (GO) IDs for each gene were obtained from TrEMBL, InterPro, and EggNOG.

Data Records

The raw data of genome and transcriptome sequencing were deposited in the Genome Sequence Archive⁴⁰ (GSA, https://ngdc.cncb.ac.cn/gsa) at the National Genomics Data Center (NGDC) under the accession number CRA014119⁴¹. The genome sequence was stored in the Genome Warehouse⁴² (GWH, https://ngdc.cncb.ac.cn/gwh) with the accession number GWHERBG00000000. All data are linked to BioProject PRJCA022271. The genome sequence was also deposited in the National Center for Biotechnology Information (NCBI) GenBank, with accession number GCA_039619475.1⁴³, under BioProject PRJNA1056491. The genome annotations are available from the Figshare repository⁴⁴.

Technical Validation

High mapping quality were confirmed in our results. After assembling PacBio reads, Bwa¹² was used to align Illumina short-reads with the assembled genome, resulted in mapping rate of 99.36%. The coverage of the genome by Illumina short-reads is 99.86% with an average depth of 51. Minimap2⁴⁵ was used to align HiFi reads with the assembled genome, resulted in mapping rate of 98.90%, and achieving a genome coverage of 99.98% with an average depth of 31. A genome preprint on I. purchasi demonstrated similar genome size and composition, with a genome size of 1,098.4 Mb and a GC content of 33.4%⁴⁶.

Code availability

All software used in this study is in the public domain and its parameters are clearly described in the methodology. If the detailed parameters of the software are not mentioned, the default parameters are used as recommended by the developer. No custom code or scripts were used for the curation and validation of the dataset.

References

García Morales, M. et al. ScaleNet: a literature-based model of scale insect biology and systematics. DATABASE-OXFORD 2016, bav118 (2016).
PubMed PubMed Central Google Scholar
Nur, U. Evolution of unusual chromosome systems in scale insects (Coccoidea: Homoptera). Insect Cytogenetics, 97–117 (1980).
Ross, L., Pen, I. & Shuker, D. M. Genomic conflict in scale insects: the causes and consequences of bizarre genetic systems. Biol. Rev. 85, 807–828 (2010).
Article PubMed Google Scholar
Normark, B. B. The evolution of alternative genetic systems in insects. Annu. Rev. Entomol. 48, 397–423 (2003).
Article CAS PubMed Google Scholar
Royer, M. Intersexuality in the Animal Kingdom 135–145 (Springer, 1975).
Mongue, A. J. et al. Sex, males, and hermaphrodites in the scale insect Icerya purchasi. Evolution 75, 2972–2983 (2021).
Article PubMed Google Scholar
Li, M. et al. A chromosome-level genome assembly provides new insights into paternal genome elimination in the cotton mealybug Phenacoccus solenopsis. Mol. Ecol. Resour. 20, 1733–1747 (2020).
Article CAS PubMed Google Scholar
Yang, P. et al. Genome sequence of the Chinese white wax scale insect Ericerus pela: The first draft genome for the Coccidae family of scale insects. GigaScience 8, giz113 (2019).
Article PubMed PubMed Central Google Scholar
Doyle, J. J. & Doyle, J. L. A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochemical Bulletin 19, 11–15 (1987).
Google Scholar
Rio, D. C., Ares, M., Hannon, G. J. & Nilsen, T. W. Purification of RNA using TRIzol (TRI reagent). Cold Spring Harbor Protocols 2010, pdb. prot5439 (2010).
Article PubMed Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Article PubMed PubMed Central Google Scholar
Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33, 2202–2204 (2017).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).
Article CAS PubMed PubMed Central Google Scholar
Wickham, H. ggplot2: Elegant Graphics for Data Analysis. (Springer, 2016).
Challis, R., Richards, E., Rajan, J., Cochrane, G. & Blaxter, M. BlobToolKit–interactive quality assessment of genome assemblies. G3- Genes, Genomes, Genet. 10, 1361–1374 (2020).
Article CAS Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA 117, 9451–9457 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Bao, Z. & Eddy, S. R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12, 1269–1276 (2002).
Article CAS PubMed PubMed Central Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358 (2005).
Article CAS PubMed Google Scholar
Wheeler, T. J. et al. Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res. 41, D70–D82 (2012).
Article PubMed PubMed Central Google Scholar
Ou, S. & Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176, 1410–1422 (2018).
Article CAS PubMed Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 1–14 (2008).
Article Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007).
Article PubMed PubMed Central Google Scholar
Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 5, 4.10. 11–14.10. 14 (2004).
Article Google Scholar
Beier, S., Thiel, T., Münch, T., Scholz, U. & Mascher, M. MISA-web: a web server for microsatellite prediction. Bioinformatics 33, 2583–2585 (2017).
Article CAS PubMed PubMed Central Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).
Article CAS PubMed Google Scholar
Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 1–9 (2004).
Article Google Scholar
Keilwagen, J. et al. Using intron position conservation for homology-based gene prediction. Nucleic Acids Res. 44, e89 (2016).
Article PubMed PubMed Central Google Scholar
Kim, D., Langmead, B. & Salzberg, S. L. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12, 357–360 (2015).
Article CAS PubMed PubMed Central Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Article CAS PubMed PubMed Central Google Scholar
Tang, S., Lomsadze, A. & Borodovsky, M. Identification of protein coding regions in RNA transcripts. Nucleic Acids Res. 43, e78 (2015).
Article PubMed PubMed Central Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Article CAS PubMed PubMed Central Google Scholar
Grabherr, M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29, 644 (2011).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9, 1–22 (2008).
Article Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Article CAS PubMed Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Article CAS PubMed PubMed Central Google Scholar
Chen, T. et al. The genome sequence archive family: toward explosive data growth and diverse data types. Genomics, Proteomics &. Bioinformatics 19, 578–583 (2021).
Google Scholar
NGDC Genome Sequence Archive (GSA) https://ngdc.cncb.ac.cn/gsa/browse/CRA014119 (2024).
Chen, M. et al. Genome warehouse: a public repository housing genome-scale data. Genomics, Proteomics & Bioinformatics 19, 584–589 (2021).
Article Google Scholar
NCBI Assembly https://identifiers.org/ncbi/insdc.gca:GCA_039619475.1 (2024).
Zhang, L. Chromosome genome annotation information of Icerya purchasi. figshare https://doi.org/10.6084/m9.figshare.24958746.v1 (2024).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Mongue, A. J., Ross, L., Watson, G. W. & Darwin Tree of Life Consortium. The genome sequence of the cottony cushion scale, Icerya purchasi (Maskell, 1879). Wellcome Open Res. 9, 21, https://doi.org/10.12688/wellcomeopenres.20653.1 (2024).

Download references

Acknowledgements

This research was supported by National Natural Science Foundation of China (Grant No. 32270499), the Special Investigation Program for National Science and Technology Basic Resources (2022FY100500) and the Fujian Agriculture and Forestry University Science Fund for Distinguished Young Scholars (Kxjq21004).

Author information

Authors and Affiliations

State Key Laboratory of Ecological Pest Control for Fujian and Taiwan Crops, College of Plant Protection, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
Jun Deng, Lin Zhang, Hui Zhang & Xiaolei Huang
Key Laboratory of Forest Disaster Warning and Control in Yunnan Province, College of Biodiversity Conservation, Southwest Forestry University, Kunming, 650224, China
Xubo Wang
Yunnan Academy of Biodiversity, Southwest Forestry University, Kunming, 650224, China
Xubo Wang

Authors

Jun Deng
View author publications
You can also search for this author in PubMed Google Scholar
Lin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xubo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolei Huang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

X.L.H. and J.D. designed and funded the study, X.B.W. collected and prepared the samples, L.Z. and H.Z. performed the bioinformatic analyses. J.D. and L.Z. drafted the manuscript. X.L.H. and J.D. revised the manuscript. All authors approved the submitted version.

Corresponding author

Correspondence to Xiaolei Huang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Deng, J., Zhang, L., Zhang, H. et al. Chromosome-level genome assembly of the cottony cushion scale Icerya purchasi. Sci Data 11, 639 (2024). https://doi.org/10.1038/s41597-024-03502-x

Download citation

Received: 23 February 2024
Accepted: 10 June 2024
Published: 17 June 2024
DOI: https://doi.org/10.1038/s41597-024-03502-x
Springer Nature Limited

Chromosome-level genome assembly of the cottony cushion scale Icerya purchasi

Abstract

Similar content being viewed by others

A chromosome-level genome assembly of the forestry pest Coronaproctus castanopsis

Chromosome level genome assembly of oriental armyworm Mythimna separata

A chromosome-level genome assembly of Sesamia inferens

Background & Summary