Chromosome-level assembly of the synthetic hexaploid wheat-derived cultivar Chuanmai 104

Liu, Zehou; Yang, Fan; Deng, Cao; Wan, Hongshen; Tang, Hao; Feng, Junyan; Wang, Qin; Yang, Ning; Li, Jun; Yang, Wuyun

doi:10.1038/s41597-024-03527-2

Chromosome-level assembly of the synthetic hexaploid wheat-derived cultivar Chuanmai 104

Data Descriptor
Open access
Published: 22 June 2024

Volume 11, article number 670, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

Chromosome-level assembly of the synthetic hexaploid wheat-derived cultivar Chuanmai 104

Download PDF

Zehou Liu ORCID: orcid.org/0000-0002-9960-8336^1,2,3,4^na1,
Fan Yang^4,5^na1,
Cao Deng ORCID: orcid.org/0000-0001-9823-9121^6,7^na1,
Hongshen Wan^1,2,3,4,
Hao Tang^1,2,3,4,
Junyan Feng^4,5,
Qin Wang^1,2,3,4,
Ning Yang^1,2,3,4,
Jun Li^1,2,3,4^na2 &
…
Wuyun Yang ORCID: orcid.org/0000-0002-8201-4929^1,2,3,4^na2

1126 Accesses
Explore all metrics

Abstract

Synthetic hexaploid wheats (SHWs) are effective genetic resources for transferring agronomically important genes from wild relatives to common wheat (Triticum aestivum L.). Dozens of reference-quality pseudomolecule assemblies of hexaploid wheat have been generated, but none is reported for SHW-derived cultivars. Here, we generated a chromosome-scale assembly for the SHW-derived cultivar ‘Chuanmai 104’ based on PacBio HiFi reads and chromosome conformation capture sequencing. The total assembly size was 14.81 Gb with a contig N50 length of 58.25 Mb. A BUSCO analysis yielded a completeness score of 99.30%. In total, repetitive elements comprised 81.36% of the genome and 122,554 high-confidence protein-coding gene models were predicted. In summary, the first chromosome-level assembly for a SHW-derived cultivar presents a promising outlook for the study and utilization of SHWs in wheat improvement, which is essential to meet the global food demand.

Chromosome-scale assembly of the wild wheat relative Aegilops umbellulata

Article Open access 25 October 2023

Chromosome-specific sequencing reveals an extensive dispensable genome component in wheat

Article Open access 08 November 2016

Chromosome-scale genome assembly of bread wheat’s wild relative Triticum timopheevii

Article Open access 23 April 2024

Background & Summary

Common wheat (Triticum aestivum L., 2n = 6x = 42, AABBDD) is a natural amphiploid derived from the intergeneric cross between T. turgidum subsp. durum (Desf.) Husn., a cultivated allotetraploid (2n = 4x = 28, AABB), and Aegilops tauschii Coss., a diploid goat grass (2n = 2x = 14, DD). The genetic diversity among common wheat cultivars has been drastically reduced owing to bottlenecks resulting from polyploidy, domestication, and modern plant breeding. The decline in genetic diversity can be counteracted by direct hybridization between common wheat and A. tauschii^1,2, or through hybridization between common wheat and synthetic hexaploid wheats (SHWs) developed by crossing tetraploid wheat and A. tauschii³.

The primary objective of the direct hybridization method is to augment genetic diversity specifically for the D genome in common wheat, addressing a crucial concern in wheat breeding, because significantly lower genetic diversity values characterize this genome compared with the A and B genomes⁴. However, the diminished genetic diversity resulting from the bottlenecks also affects the A and B genomes. Consequently, the utilization of SHW lines enables the diversity of all three subgenomes of common wheat to be enhanced. This approach facilitates the direct transfer of genes/loci for traits of interest from diploid and tetraploid to hexaploid wheat.

To date, the International Maize and Wheat Improvement Centre (CIMMYT) has developed more than 1200 SHW lines³. Since the introduction of more than 200 SHW accessions from CIMMYT in 1995, four SHW-derived cultivars, namely, Chuanmai 38, 42, 43, and 47, have been raised and cultivated, which have been widely used in wheat breeding as elite parents in China. Subsequently, a number of secondary SHW-derived cultivars have been developed and released, including Chuanmai 104, developed from the cross of Chuanmai 42 and Chuannong 16. Chuanmai 104 is an important high-yielding wheat cultivar grown in Southwest China in recent years. The maximum yield of Chuanmai 104 attains 10,947 kg/ha under the humid and predominantly cloudy climate of the Sichuan Basin in Southwest China⁵. Chuanmai 104 is becoming a cornerstone breeding parent of wheat in China. Furthermore, China is among the main countries that are exploiting the advantages of SHW lines as genetic resources, especially in Southwest China³. The increasing utilization of SHW worldwide is indicative of the success of such an approach, which will gradually become an effective means of overcoming the bottleneck of wheat breeding.

Considering previous studies based on SHWs, a major potential limiting factor is the limited genetic resources and lack of reference-quality pseudomolecule assemblies (RQAs)⁶. Chapman et al. integrated whole-genome sequencing and genetic mapping to assemble and ordered contigs of the SHW cultivar W7984⁷. However, given the short reads generated by next-generation sequencing (NGS), and the lack of chromosome conformation capture sequencing or chromosome isolation via flow sorting, the assembly was only 9.1 Gb, which was substantially less than the estimated 15 Gb size of the hexaploid wheat genome⁶. Although single-nucleotide polymorphism (SNP) genotyping arrays are relatively simple and inexpensive, a limitation is that only the variants pre-selected for inclusion on the array can be analyzed. Consequently, if the SNP panels were designed using common wheat genome assemblies, they would lack sufficient representation of variants in the target gene pools, and thus assessment of useful variation in SHW and derivative germplasm would be challenging. More recently, reduction in costs have meant that RQAs and large-scale whole-genome resequencing are feasible and affordable for SHWs.

In the current study, we first generated a chromosome-level assembly for Chuanmai 104 (Fig. 1), based on an integrated approach including PacBio HiFi sequencing reads and chromosome conformation capture sequencing. The final Chuanmai 104 genome assembly consisted of 14.81 Gb with a contig N50 of 58.25 Mb, a contig N90 of 8.41 Mb, and a longest contig of 422.27 Mb (Table 1). Among previously published hexaploidy wheat assemblies, seven of the 21 chromosomes in the Chuanmai 104 were the longest (Table 2). The long terminal repeat (LTR) Assembly Index (LAI)⁸ of the Chuanmai 104 genome assembly was 15.17, 14.64, and 10.85 for A subgenome, B subgenome, D subgenome respectively, and for each chromosome, the LAI values ranges from 10.21 to 15.71 (Table 3). Benchmarking universal single-copy orthologs (BUSCO) analysis yielded a completeness score of 99.30%, which was comparable with that of common wheat genomes and notably higher than that of the SHW cultivar W7984 (Table 1). Repeats comprised 81.36% of the sequences with a predominance of retrotransposons, which accounted for 62.96% of the sequences (Table 4). In total, 122,554 high-confidence and 136,431 low-confidence protein-coding gene models were predicted (Table 5); this number was similar to that for the common wheat Chinese Spring (Table 1). The high-quality Chuanmai 104 genome assembly generated in this study provides a reference genome for SHW-derived cultivars, and offers a promising outlook for the study and utilization of SHW genetic resources in wheat improvement, which is essential to meet the global food demand.

Table 1 The summary results of the SHWs (CM104 and W7984) and common wheats (Fielder, Kariega, Attraktion, Renan, CS) genome assemblies.

Full size table

Table 2 The chromosomes lengths of selected representative hexaploid wheat genomes.

Full size table

Table 3 Statistics of number of contigs, LAI and non-coding RNAs on each chromosome in Chuanmai 104.

Full size table

Table 4 The statistics for the repeats in the Chuanmai 104 genome.

Full size table

Table 5 Statistics of gene structural and functional annotation.

Full size table

Methods

Plant material, DNA extraction, and sequencing

The SHW-derived cultivar Chuanmai 104 was kindly provided by Wuyun Yang (Crop Research Institute, Sichuan Academy of Agricultural Sciences). The plants used for sequencing were grown in a growth chamber with a controlled environment of 20 degree Celsius under a 12 h light/12 h dark photoperiod for 2 weeks. Genomic DNA (gDNA) was extracted from seedling leaf tissues using the cetyltrimethylammonium bromide method. Three methods were applied for DNA quantification and quality testing, including (i) NanoDrop 2000 spectrophotometer (Thermo Fischer Scientific), (ii) gel electrophoresis and (iii) Qubit fluorometer (Invitrogen). Total DNA was purified by AMPure PB beads (Pacific Biosciences, CA, USA; PN 100-265-900). High-quality gDNA (≥10 μg, ≥100 ng/μl) was prepared for the next step of library construction. PacBio single-molecule real-time (SMRT) bell library preparation was performed using the SMRTbell® Express Template Prep Kit 2.0 (Pacific Biosciences, CA, USA; PN 101-853-100) in accordance with the manufacturer’s instructions. The library was prepared for sequencing with a 30 h movie on the Sequel IIe system (Pacific Biosciences) by the Berry Genomics Corporation (Beijing, China). Totally, we generated 668.43 Gb bases (~45X) with 40,999,150 CCS reads from 20 SMRT cells.

Chromosome conformation capture (Hi-C) sequencing of Chuanmai 104 was performed using the protocol of Peng et al.⁹. In brief, 2–4 g tender leaves from the plants used for genome sequencing were harvested and stored in liquid nitrogen, and then the Hi-C libraries were prepared and sequenced on the MGISEQ-2000 platform by BGI (Wuhan, China). Samples were cut into pieces of ca. 2 cm², and transferred to 50 ml tubes containing 15 ml of ice-cold nuclear isolation buffer (NBE) with 2% formaldehyde, followed by vacuum infiltration (400 mbar) and incubation with a supplemented cross-linking agent for 1 h. Cross-linking was quenched by adding 2 M glycine to a final concentration of 0.125 M with incubation for 5 min under vacuum, followed by fixation on ice. Then, the fixed leaf pieces were washed three times with sterile Milli-Q water, ground in liquid nitrogen and subjected to nucleus isolation. The isolated nuclei were purified, checked for quality and quantity and digested with 100 units of DpnII. The next steps were Hi-C specific, including marking the DNA ends with biotin-14-dATP and performing blunt-end ligation of the cross-linked fragments. After ligation, cross-linking was reversed by overnight incubation with proteinase K at 65 °C. Biotin-14-dATP was further removed from non-ligated DNA ends using the exonuclease activity of T4 DNA polymerase. DNA was purified by phenol:chloroform (1:1) extraction, precipitated and washed as previously described. The purified DNA was physically sheared to a size of 300–600 bp by sonication and was size-fractionated using standard 2% agarose gel electrophoresis to obtain fragments in the range of 300–600 bp. The fragmented ends were blunt-end repaired and A-tailed, followed by purification through biotin-streptavidin-mediated pulldown. PCR amplification was conducted using 12–15 cycles to enrich the ligation products. Totally, we generated more than 2 Tb bases (>135 X) with 6.69 Gb read pairs.

For full-length transcriptome sequencing, we collected pooled sample for Chuanmai 104, which comprised whole plant organs except for roots from seed germination to the three-leaf stage, shoots at the seedling stage, and leaves, stems, ears, and seeds from the heading to the late-filling stages. Total RNA was isolated using TRIzol Reagent in accordance with the manufacturer’s instructions (Thermofisher). The RNA purity and raw contamination were first assed by Nanodrop 2000 (Thermo Fischer Scientific), and then the RNA Integrity Number (RIN) and concentration were further assessed by an Agilent 4200 (Agilent Technologies). High-quality RNA (2 μg, 300 ng/μl) was prepared for the next step of library construction. PacBio SMRT bell library preparation was performed using the SMRTbell® Express Template Prep Kit 2.0 (Pacific Biosciences) in accordance with the manufacturer’s instructions. The library was prepared for sequencing with a 30 h movie on the Sequel IIe system (Pacific Biosciences) by the Berry Genomics Corporation. Totally, we generated 186.35 Gb bases with 2,283,790 polymerase reads from one SMRT cell. The final 46,130,981 subreads range from 51 bp to 241,082 bp, with a mean and N50 value of 4,039.55 bp and 4,561 bp respectively.

Genome assembly

The PacBio HiFi CCS reads were assembled using hifiasm¹⁰ (v0.16.1, with default parameters). The Hi-C reads were incorporated using Juicer tools¹¹ (v1.6) and EndHiC¹². In brief, preprocessing of the Hi-C reads was performed with juicer.sh¹¹ (parameter: -s DpnII). The output file corresponding to the Hi-C contacts with duplicates removed and mapping quality values larger than 30 was generated as input for EndHiC¹². These result files were plotted to visualize the Hi-C map and for manual curation, and were used to generate the final assembly (21 pseudomolecules and one unanchored pseudomolecule). The NCBI Foreign Contamination Screen (FCS)¹³ was used to identify and remove contaminant sequences (adaptors and organelles) in genome assemblies. Totally, the FCS identified total 754 contaminant fragments, including one adaptor fragment and 753 mitochondrial fragments, and all these contaminants are located on the unanchored pseudomolecule and were masked.

Validation of genome assemblies

Genome sizes were estimated using three algorithms (gce¹⁴, GenomeScope2¹⁵, and findGSE¹⁶) with different k-mer sizes. The quality and completeness of the genome assemblies were assessed by merqury¹⁷, which uses a reference-free, k-mer-based approach, and BUSCO¹⁸ (v5, poales_odb10), which is based on evolutionarily informed expectations of the near-universal single-copy orthologous gene content. LTR assembly index (LAI)⁸ that evaluates assembly continuity using LTR-RTs were calculated.

Subgenome assignment, validation, and nomenclature

To assign each chromosome to each linkage group and apply the corresponding nomenclature in Chinese Spring, SubPhaser¹⁹, a robust allopolyploid subgenome phasing method based on subgenome-specific k-mers, was used. To validate the correctness of the subgenome assignment, a reference-guided strategy based on subgenome homology was also used to distinguish the subgenomes. We mapped the Chuanmai 104 genome to the Chinese Spring genome using mashmap²⁰ (-f map–perc_identity 90 -s 1000000). Then, the alignments were plotted and manually checked. This procedure successfully categorized the 21 chromosomes into three homologous groups. The nomenclature system for Chinese Spring chromosomes was adopted for naming of the homologous groups (1–7) of the Chuanmai 104 genome.

Repeat annotation

Tandem repeats of all lengths were annotated with TandemRepeatsFinder²¹ v4.09 using the default parameters (Match Mismatch Delta PM PI Minscore MaxPeriod: 2 7 7 80 10 50 500). LTR_FINDER²² (v1.05) and LTR_harvest²³ (v1.5.10) were used for long terminal repeat (LTR) identification, and the results were processed with LTR_retriever²⁴ (v2.8) to generate a species-specific LTR library. The species-specific LTR libraries, wheat transposable element (TE) sequences from ClariTeRep (https://github.com/jdaron/CLARI-TE), and plant TE sequences from Repbase²⁵ were merged to generate the TE library. Transposons were detected and classified by a homology search against the combined TE library. The program Vmatch (http://www.vmatch.de/), a fast and efficient matching tool suitable for large and highly repetitive genomes, was used for this computationally intensive task with the following parameters: identity ≥ 70%, minimal hit length 75 bp, and seed length 12 bp (exact command line: -d -p -l 75 -identity 70 -seedlength 12 -exdrop 5).

Non-coding gene annotation

Noncoding RNAs (ncRNAs), including miRNAs, small nuclear RNAs, rRNAs and regulatory elements, were identified using the Infernal²⁶ (version 1.1.2) program to search against the Rfam²⁷ database (v14.8). The rRNAs, and tRNAs were further identified using RNAmmer²⁸ (version 1.2) and tRNAscan-SE²⁹ (v1.3.1) respectively.

Protein-coding gene annotation

Gene model prediction was performed following the method described by Mascher et al.³⁰, with minor modifications, which integrated transcriptomic data, protein homology, and ab initio prediction. (1) First, isoform sequencing (Iso-Seq) data were mapped to the genome using minimap2³¹ (v2.17-r941; parameters: -ax splice -uf –secondary = no -C5). The redundant isoforms were further collapsed into transcript loci using cDNA_Cupcake (https://github.com/Magdoll/cDNA_Cupcake) (parameter: –dun-merge-5-shorter). TransDecoder (v5.5.0, https://github.com/TransDecoder/TransDecoder) was used to predict protein sequences among the transcripts. (2) For protein homology evidence, we projected the gene structures from Triticeae species, comprising Ae. tauschii, T. turgidum subsp. dicoccoides, T. turgidum subsp. durum, T. aestivum Chinese Spring, T. urartu, Ae. speltoides, and Hordeum vulgare Morex, onto the Chuanmai 104 genome using liftoff³² with default parameters. (3) We produced ab initio gene predictions using AUGUSTUS³³ (v3.4.0), GeneMark-ET³⁴ (v4.38), and GeneID³⁵ (v1.4). In brief, AUGUSTUS³³ gene prediction was performed using a model specifically trained from the software and a hints file generated using the previously mentioned Iso-Seq predictions. GeneMark-ET³⁴ was used with the option -ET, and the intron coordinates were calculated using the above-mentioned Iso-Seq alignments. GeneID³⁵ was run with a model specifically trained from the software (-GP taestivum.param). We used EVidenceModeler³⁶ (EVM; v1.1.1) to integrate all of the gene evidence from transcriptomics, protein alignments, and ab initio predictions.

Protein-coding gene models from EVM were classified as high-confidence or low-confidence according to criteria used by the International Wheat Genome Sequencing Consortium, with minor modifications³⁷. In brief, protein-coding gene models were considered as ‘complete’ when start and stop codons were present. A comparison with PTREP³⁸ (the database of hypothetical proteins deduced from the nonredundant database of TEs within the TREP database), UniPoa³⁹ (Poaceae database of annotated proteins from the UniProt database), and UniMag⁴⁰ (validated Magnoliophyta proteins from SwissProt) was performed using DIAMOND⁴¹ (v2.0.9; parameters: -e 1e-10 –query-cover 80–subject-cover 80). Gene candidates were classified using the following criteria: a high-confidence gene model was ‘complete’ with a hit in the UniMag⁴⁰ database and/or in UniPoa³⁹ but not PTREP³⁸; the remaining gene models were classified as low-confidence genes.

Functional assignments of the predicted protein-coding genes were obtained with BLAST⁴² by aligning the coding regions to sequences in public protein databases, including the trEMBL⁴⁰, RefSeq⁴³., and SwissProt⁴⁰ databases. The putative domains and GO⁴⁴ terms of the predicted proteins were identified using the InterProScan⁴⁵ program. The putative orthologs in the KEGG⁴⁶ database were identified using KoFamScan⁴⁷.

Data Records

The HiFi reads, Iso-seq reads, and Hi-C reads that were used for the Chuanmai 104 genome assembly have been deposited in the NCBI Sequence Read Archive with accession number SRP488123 and under BioProject number PRJNA1070409⁴⁸. The HiFi reads, Iso-seq reads, and Hi-C reads were also deposited in the National Genomics Data Centre (NGDC) with BioProject ID PRJCA022052 (https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA022052). The genome assembly has been deposited at GenBank under the accession JBBIFV000000000⁴⁹. The genome assemblies and annotations have also been deposited at FigShare⁵⁰ with doi number https://doi.org/10.6084/m9.figshare.25282654.

Technical Validation

The assembled genome size is similar to the size estimated by different algorithms^14,15,16 (Fig. 2a–c), and is significantly larger than that published previously for the SHW cultivar W7984 (Table 1). The base-level accuracy QV (consensus quality value) and k-mer completeness scores evaluated with merqury¹⁷ are 65.86 and 97.59%, respectively. The long terminal repeat (LTR) Assembly Index (LAI)⁸ of the Chuanmai 104 genome assembly was 15.17, 14.64, and 10.85 for A subgenome, B subgenome, D subgenome respectively, which are higher than the LAI values obtained for Chinese Spring (11.88, 12.51 and 9.97 for A subgenome, B subgenome, D subgenome respectively). The BUSCO¹⁸ score is 99.3% and only 0.7% BUSCO genes are missing (Fig. 2d). These results indicate a high completeness of the Chuanmai 104 assembly. Comparison with other common wheat genome assemblies revealed that the Chuanmai 104 NG50 value was significantly larger, implying high connectivity (Fig. 2e). The GC-depth plot (Fig. 2f) of the Chuanmai 104 genome across every 2 kb nonoverlapping sliding window showed no distinct secondary peaks, indicating that haplotype homology was adequately recognized during assembly.

The Hi-C contact map was manually curated and assessed with Juicebox and revealed a dense pattern along the diagonal, indicating no potential mis-assemblies (Fig. 3). The anti-diagonals are typical for Triticeae genomes⁵¹ (Fig. 3). The distribution of the A. tauschii subtelomeric tandem repeat sequences (NCBI GenBank accessions: AY249980.1, AY249981.1, and AY249982.1) and T. monococcum subsp. aegilopoides centromere-specific tandem repeat sequences (NCBI GenBank accessions: DQ904440.1 and EF624064.1) indicate the completeness in these complex regions (Fig. 1a–c).

Using SubPhaser¹⁹, a robust allopolyploid subgenome phasing method based on subgenome-specific k-mers, the 21 chromosomes of the Chuanmai 104 genome were aggregated into three linkage groups (Fig. 1h–k). These groups show high synteny to chromosomes of Chinese Spring at both the nucleotide and protein levels (Fig. 4), indicating the correctness of the chromosome assembly. Moreover, these synteny results show the relative conservation of the common wheat and SHW genomes, although the sources of the subgenomes and their evolutionary history differ.

Code availability

All software and pipelines were executed according to the manual and protocol of published tools. No custom code was generated for these analyses.

References

Gill, B. S. et al. Wheat Genetics Resource Center: The First 25 Years. in Advances in Agronomy vol. 89 73–136 (Academic Press, 2006).
Gill, B. S. & Raupp, W. J. Direct Genetic Transfers from Aegilops squarrosa L. to Hexaploid Wheat1. Crop Science 27, cropsci1987.0011183X002700030004x (1987).
Li, A., Liu, D., Yang, W., Kishii, M. & Mao, L. Synthetic Hexaploid Wheat: Yesterday, Today, and Tomorrow. Engineering 4, 552–558 (2018).
Article CAS Google Scholar
Mazzucotelli, E. et al. Gene Flow Between Tetraploid and Hexaploid Wheat for Breeding Innovation. in The Wheat Genome (eds. Appels, R., Eversole, K., Feuillet, C. & Gallagher, D.) 135–163. https://doi.org/10.1007/978-3-031-38294-9_8 (Springer International Publishing, Cham, 2024).
Li, J., Wan, H.-S. & Yang, W.-Y. Synthetic hexaploid wheat enhances variation and adaptive evolution of bread wheat in breeding processes. Journal of Systematics and Evolution 52, 735–742 (2014).
Article Google Scholar
Walkowiak, S. et al. Multiple wheat genomes reveal global variation in modern breeding. Nature 588, 277–283 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Chapman, J. A. et al. A whole-genome shotgun approach for assembling and anchoring the hexaploid bread wheat genome. Genome Biology 16, 26 (2015).
Article PubMed PubMed Central Google Scholar
Ou, S., Chen, J. & Jiang, N. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res 46, e126 (2018).
PubMed PubMed Central Google Scholar
Peng, Y. et al. Reference genome assemblies reveal the origin and evolution of allohexaploid oat. Nat Genet 54, 1248–1258 (2022).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst 3, 95–98 (2016).
Article CAS PubMed PubMed Central Google Scholar
Wang, S. et al. EndHiC: assemble large contigs into chromosome-level scaffolds using the Hi-C links from contig ends. BMC Bioinformatics 23, 528 (2022).
Article CAS PubMed PubMed Central Google Scholar
Astashyn, A. et al. Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biology 25, 60 (2024).
Article CAS PubMed PubMed Central Google Scholar
Liu, B. et al. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. Preprint at https://www.semanticscholar.org/paper/Estimation-of-genomic-characteristics-by-analyzing-Liu-Shi/716199abb13c0cab875f3dfe6302cce857685385 (2013).
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun 11, 1432 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Sun, H., Ding, J., Piednoël, M. & Schneeberger, K. findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics 34, 550–557 (2018).
Article CAS PubMed Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21, 245 (2020).
Article CAS PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Jia, K.-H. et al. SubPhaser: a robust allopolyploid subgenome phasing method based on subgenome-specific k-mers. New Phytologist 235, 801–809 (2022).
Article CAS PubMed Google Scholar
Jain, C., Koren, S., Dilthey, A., Phillippy, A. M. & Aluru, S. A fast adaptive algorithm for computing whole-genome homology maps. Bioinformatics 34, i748–i756 (2018).
Article CAS PubMed PubMed Central Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573 (1999).
Article CAS PubMed PubMed Central Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, W265–W268 (2007).
Article PubMed PubMed Central Google Scholar
Ellinghaus, D., Kurtz, S. & Willhoeft, U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9, 18 (2008).
Article PubMed PubMed Central Google Scholar
Ou, S. & Jiang, N. LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons. Plant Physiol 176, 1410–1422 (2018).
Article CAS PubMed Google Scholar
Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462–467 (2005).
Article CAS PubMed Google Scholar
Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009).
Article CAS PubMed PubMed Central Google Scholar
Griffiths-Jones, S. et al. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res 33, D121–124 (2005).
Article CAS PubMed Google Scholar
Lagesen, K. et al. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35, 3100–3108 (2007).
Article CAS PubMed PubMed Central ADS Google Scholar
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25, 955–964 (1997).
Article CAS PubMed PubMed Central Google Scholar
Mascher, M. et al. Long-read sequence assembly: a technical evaluation in barley. The Plant Cell 33, 1888–1906 (2021).
Article PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2021).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435–W439 (2006).
Article CAS PubMed PubMed Central Google Scholar
Besemer, J. & Borodovsky, M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res 33, W451–454 (2005).
Article CAS PubMed PubMed Central Google Scholar
Blanco, E., Parra, G. & Guigó, R. Using geneid to identify genes. Curr Protoc Bioinformatics Chapter 4, Unit 4.3 (2007).
PubMed Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 9, R7 (2008).
Article PubMed PubMed Central Google Scholar
The International Wheat Genome Sequencing Consortium (IWGSC). et al. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science 361, eaar7191 (2018).
Article Google Scholar
The TREP platform: A curated database of transposable elements. https://trep-db.uzh.ch/.
UniProt. https://www.uniprot.org/.
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003).
Article CAS PubMed PubMed Central Google Scholar
Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 18, 366–368 (2021).
Article CAS PubMed PubMed Central Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Article PubMed PubMed Central Google Scholar
Pruitt, K. D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756–D763 (2014).
Article CAS PubMed Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
Quevillon, E. et al. InterProScan: protein domains identifier. Nucleic Acids Res 33, W116–120 (2005).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–D114 (2012).
Article CAS PubMed Google Scholar
Aramaki, T. et al. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36, 2251–2252 (2020).
Article CAS PubMed Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP488123 (2024).
NCBI Assembly https://identifiers.org/ncbi/insdc.gca:GCA_039655515.1 (2024).
Liu, Z. Genome assembly and annotation of the synthetic hexaploid wheat-derived cultivar Chuanmai 104. figshare https://doi.org/10.6084/m9.figshare.25282654 (2024).
Abrouk, M. et al. Chromosome-scale assembly of the wild wheat relative Aegilops umbellulata. Sci Data 10, 739 (2023).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This publication is based on work supported by the Accurate Identification Project of Crop Germplasm from Sichuan Provincial Finance Department (1 + 3 ZYGG001), the Science and Technology Department of Sichuan Province (2022NSFSC0601, 2021YFYZ0020, 2023NSFSC1925), the Program of Chinese Agriculture Research System (CARS-03), Strategic Scientist Studio of Sichuan Academy of Agricultural Sciences. We thank Robert McKenzie, PhD, from Liwen Bianji (Edanz) (www.liwenbianji.cn) for editing a draft of this manuscript.

Author information

These authors contributed equally: Zehou Liu, Fan Yang, Cao Deng.
These authors jointly supervised this work: Wuyun Yang, Jun Li.

Authors and Affiliations

Crop Research Institute, Sichuan Academy of Agricultural Sciences, Chengdu, China
Zehou Liu, Hongshen Wan, Hao Tang, Qin Wang, Ning Yang, Jun Li & Wuyun Yang
Environment Friendly Crop Germplasm Innovation and Genetic Improvement Key Laboratory of Sichuan Province, Chengdu, China
Zehou Liu, Hongshen Wan, Hao Tang, Qin Wang, Ning Yang, Jun Li & Wuyun Yang
Key Laboratory of Wheat Biology and Genetic Improvement on Southwestern China, Chengdu, China
Zehou Liu, Hongshen Wan, Hao Tang, Qin Wang, Ning Yang, Jun Li & Wuyun Yang
Key Laboratory of Tianfu Seed Industry Innovation, Chengdu, China
Zehou Liu, Fan Yang, Hongshen Wan, Hao Tang, Junyan Feng, Qin Wang, Ning Yang, Jun Li & Wuyun Yang
Biotechnology and Nuclear Technology Research Institute, Sichuan Academy of Agricultural Sciences, Chengdu, China
Fan Yang & Junyan Feng
The Key Laboratory of Animal Disease and Human Health of Sichuan Province, College of Veterinary Medicine, Sichuan Agricultural University, Chengdu, China
Cao Deng
Departments of Bioinformatics, DNA Stories Bioinformatics Center, Chengdu, China
Cao Deng

Authors

Zehou Liu
View author publications
You can also search for this author in PubMed Google Scholar
Fan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Cao Deng
View author publications
You can also search for this author in PubMed Google Scholar
Hongshen Wan
View author publications
You can also search for this author in PubMed Google Scholar
Hao Tang
View author publications
You can also search for this author in PubMed Google Scholar
Junyan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Qin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ning Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Li
View author publications
You can also search for this author in PubMed Google Scholar
Wuyun Yang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.Y. and J.L. designed the study and provided funding. Z.L., F.Y. and C.D. collected and prepared the plant tissues samples for sequencing. Z.L., F.Y. and C.D. led the bioinformatics analyses and genome analyses. Z.L. and C.D. drafted the manuscript. W. Y., J. L., F.Y., H.W. H.T. J.F., Q.W. N.Y. revised the manuscript. Z.L., F.Y. and C.D. contributed equally. All authors discussed the results and approved the manuscript.

Corresponding authors

Correspondence to Jun Li or Wuyun Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, Z., Yang, F., Deng, C. et al. Chromosome-level assembly of the synthetic hexaploid wheat-derived cultivar Chuanmai 104. Sci Data 11, 670 (2024). https://doi.org/10.1038/s41597-024-03527-2

Download citation

Received: 01 March 2024
Accepted: 14 June 2024
Published: 22 June 2024
DOI: https://doi.org/10.1038/s41597-024-03527-2
Springer Nature Limited

Associated content

Genomics data for plant ecology, conservation and agriculture

Collection 20 January 2023

Chromosome-level assembly of the synthetic hexaploid wheat-derived cultivar Chuanmai 104

Abstract

Similar content being viewed by others

Chromosome-scale assembly of the wild wheat relative Aegilops umbellulata

Chromosome-specific sequencing reveals an extensive dispensable genome component in wheat

Chromosome-scale genome assembly of bread wheat’s wild relative Triticum timopheevii

Background & Summary

Methods