Whole-genome sequencing of 13 Arctic plants and draft genomes of Oxyria digyna and Cochlearia groenlandica

Kim, Jun; Lim, Jiseon; Kim, Moonkyo; Lee, Yoo Kyung

doi:10.1038/s41597-024-03569-6

Whole-genome sequencing of 13 Arctic plants and draft genomes of Oxyria digyna and Cochlearia groenlandica

Data Descriptor
Open access
Published: 18 July 2024

Volume 11, article number 793, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

Whole-genome sequencing of 13 Arctic plants and draft genomes of Oxyria digyna and Cochlearia groenlandica

Download PDF

781 Accesses
8 Altmetric
1 Mention
Explore all metrics

Abstract

To understand the genomic characteristics of Arctic plants, we generated 28–44 Gb of short-read sequencing data from 13 Arctic plants collected from the High Arctic Svalbard. We successfully estimated the genome sizes of eight species by using the k-mer-based method (180–894 Mb). Among these plants, the mountain sorrel (Oxyria digyna) and Greenland scurvy grass (Cochlearia groenlandica) had relatively small genome sizes and chromosome numbers. We obtained 45 × and 121 × high-fidelity long-read sequencing data. We assembled their reads into high-quality draft genomes (genome size: 561 and 250 Mb; contig N50 length: 36.9 and 14.8 Mb, respectively), and correspondingly annotated 43,105 and 29,675 genes using ~46 and ~85 million RNA sequencing reads. We identified 765,012 and 88,959 single-nucleotide variants, and 18,082 and 7,698 structural variants (variant size ≥ 50 bp). This study provided high-quality genome assemblies of O. digyna and C. groenlandica, which are valuable resources for the population and molecular genetic studies of these plants.

Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum

Article Open access 11 November 2015

The first high-quality genome assembly and annotation of Lantana camara, an important ornamental plant and a major invasive species

Article Open access 10 May 2024

The Draft Genome of the MD-2 Pineapple

Background & Summary

Arctic plants live in vulnerable environments. They are exposed to short growing seasons, temperature fluctuations, strong winds, and oligotrophic soils¹. Occasionally, their habitats are disturbed by an overflow of glacial meltwater². The formation of thaw ponds by permafrost thawing and changes in temperature and precipitation results in vegetation shifts and/or community trait changes^3,4. Arctic plants are driven into competition with subarctic plants and are influenced by boreal animals, such as moose, beaver, red fox, and boreal birds that expand into the Arctic tundra⁵. Before the population of plants of the Arctic tundra is endangered, investigating the legacy of their adaptation to harsh Arctic environments merits understanding. However, little is known about the genomes of Arctic plants. Therefore, we attempted to obtain whole-genome sequencing data for 13 Arctic plants commonly found in Svalbard, using Illumina sequencing technology.

Oxyria digyna is a fast-growing forb that is widely distributed in the Arctic and alpine regions^6,7,8. This plant is an important food source for herbivores, such as insects, birds, mammals, and even indigenous people in the Arctic^6,9,10,11. Phylogeographic analysis of the plastid and nuclear genes showed that O. digyna originated in the Qinghai-Tibet Plateau and spread to Russia, eastward to North America, and westward to Western Europe⁸.

Despite its wide geographic distribution, only two major ecotypes of O. digyna, namely northern and southern, have been well-characterized. Both are long-day plants, and the northern type has fewer flowering branches and more rhizomes than the southern type^12,13. As no reference genome sequence is available for this species, the genetic architecture underlying phenotypic variations and other population genetic structures has not yet been fully resolved.

Cochlearia is a genus comprising approximately 30 species, including annual and perennial herbs of the family Brassicaceae. The leaves are smoothly rounded or kidney-shaped and have long stalks that resemble spoons¹⁴. The scientific name Cochlearia derives from the Greek “kokhliárion” meaning a spoon and the English name of Cochlearia is also spoonwort. Some Cochlearia plants contain enough vitamin C and another common name, “scurvy-grass,” reflects its use as a traditional remedy for scurvy, a disease caused by the deficiency of vitamin C. Salt and heavy metal tolerance has been reported in some Cochlearia plants¹⁵.

Cochlearia groenlandica is a biennial, occasionally short-lived, perennial herb with a wide distribution in Greenland, Svalbard, Iceland, Alaska, Canada, and Russia¹⁶. It grows in various habitats, including gravelly and sandy plains, sediment plains, moss tundra, patterned grounds, seashores, and bird-cliff meadows. It can grow relatively well regardless of the soil pH or nutrient level and can survive and reproduce under harsh conditions as a stunted form¹⁷. The size of individual plants and leaves varies; they can be ten times larger in nutrient-rich areas, such as below bird cliffs. Polar bears graze on C. groenlandica at the foot of a large seabird colony on a cliff in Spitsbergen, Svalbard¹⁸. Chromosome counting confirmed the haploid chromosome number of C. groenlandica at x = 7^19,20. There are two populations of C. groenlandica in Iceland with genetic and morphological differences. The alpine population is genetically and morphologically similar to those in Greenland and Svalbard, whereas the coastal population is different from the Arctic population²⁰. Similar to O. digyna, C. groenlandica has no reference genome or limited genomic resources, which limits the genetic understanding of this Arctic plant.

In this study, we provide short-read sequencing data of 13 Arctic plants as well as high-quality draft genome sequences of O. digyna and C. groenlandica generated using the high-fidelity (HiFi) long-read sequencing technology of Pacific Biosciences (PacBio) (Fig. 1). We assessed the assembly quality and annotated the genetic variants, repetitive sequences, and genes in these genome assemblies (Fig. 2). These are the first draft genomes of the genera, Oxyria and Cochlearia, based on long-read sequencing data. This will be a valuable resource for understanding the genomic characteristics of Arctic plants and the population structures of O. digyna and C. groenlandica.

Methods

Plant sampling

Arctic plant samples were collected from Spitsbergen Island in the Svalbard Archipelago between July 25 and August 3, 2021. Thirteen dominant species in Svalbard were selected for genome size estimation, including Eriophorum scheuchzeri ssp. arcticum, Papaver dahlianum, Silene acaulis, Silene uralensis ssp. arctica, Bistorta vivipara, Oxyria digyna, Saxifraga oppositifolia, Cochlearia groenlandica, Salix polaris, Dryas octopetala, Betula nana ssp. nana, Cassiope tetragona, and Polemonium boreale (Fig. 1). Only the indicated number of leaves required for analysis (less than 2.0 g) was collected from each plant to minimize its impact on the local population. The collected samples were immediately placed in yellow envelopes and dried in the field using silica gel. The plant leaves were lyophilized in envelopes at the Dasan Arctic Research Station and stored in 15-mL conical tubes at 4 °C until they were transferred to Korea.

DNA extraction and Illumina sequencing

DNA was extracted using the GeneAll® ExgeneTM Plant SV Mini Kit (GeneAll Biotechnology). DNA extraction was performed by slightly modifying the manufacturer protocol. The concentration and quality of the extracted DNA were checked using a NanoPhotometer® NP80 (Implen GmbH), and the DNA quality was checked by electrophoresis.

Illumina sequencing was performed by DNALink (https://dnalink.com/). The DNA library for sequencing was produced using an Illumina DNA Sample Prep Kit, and the completed DNA library was sequenced with paired-end reads of 151 bp using the Illumina NovaSeq 6000 platform. DNA sequencing data (28–44 Gb) were obtained (Table 1)²¹, which were suitable for estimating genome sizes of approximately 1 Gb.

Table 1 Species information and sequencing data summary of 13 Svalbard plants. ND, Not determined; NA, Not applicable; Chromosome numbers and ploidy data were obtained from Svalbard Flora (https://svalbardflora.no/).

Full size table

Genome size estimation

KAT (version 2.4.1) was used to estimate the genome size of 13 Svalbard plants²². The command kat hist -m 27 was applied to all prepared Illumina sequencing data. The genome sizes of the eight species ranged from 180 Mb to 894 Mb (Table 1). It is noteworthy that if their genomes are highly repetitive, the true genome size could be larger than the estimated size, although the order of magnitude would be similar. We were unable to determine the genome sizes of Bistorta vivipara, Papaver dahlianum, Polemonium boreale, Silene acaulis, and Saxifraga oppositifolia. We suspect that the genome size of these species may have exceeded 1 Gb. Because the genome sizes of O. digyna and C. groenlandica were estimated to be relatively small, ~352 and ~180 Mb, respectively, and their haploid (n) chromosome numbers were only seven, we focused on constructing the draft genomes of these plant species.

Identification of telomeric repeats

We also attempted to annotate the telomeric repeats of each species using the short-read sequencing data, because telomeric repeats can be changed in plants and other species-rich phylums^{23,24,25,26,27,28,29,30,31,32,33}. As telomeric repeats should be found as sequential and continuous repetitive sequences in a single read, we subsampled 20 million reads from the Illumina reads of the 13 Svalbard plants and only used 60-bp regions of each read by trimming off the initial 10 bp and the last 81 bp. We then counted k-mers of sequentially repetitive units that were similar to the canonical plant telomeric repeat, TTTAGGG (Arabidopsis-type)²⁶.

Possible telomeric repeats of every species were successfully discovered, and 11 out of 13 species contained canonical plant telomeric repeats, TTTAGGG (Arabidopsis-type) (Table 1 and Fig. 3a). The remaining two species, P. dahlianum and S. oppositifolia, had distinct noncanonical telomeric repeats. The telomeric repeat of S. oppositifolia, TTTTAGGG, is the Chlamydomonas-type²⁵. However, the telomeric repeat of P. dahlianum, TTCAGGG, was novel.

DNA and RNA sequencing analyses of Oxyria digyna and Cochlearia groenlandica for de novo genome assembly

All the experiments described in this section were performed by NICEM (https://nicem.snu.ac.kr). DNA was extracted using the cetyltrimethylammonium bromide (CTAB) method³⁴. The library was prepared using the SMRTbell template prep kit 2.0 and sequenced on a PacBio Sequel IIe sequencer using the Sequel II sequencing kit 2.0. The computed data were subjected to the Circular Consensus Sequencing (CCS) protocol of the SMRT Link (ver. 11.0.0.146107) to derive the HiFi data (CCS reads of Q20 or higher). Based on the estimated genome sizes, we generated 25.5 Gb (1.5 M reads, 17 kb on average) and 30.2 Gb (1.6 M reads, 18 kb on average) of HiFi long-read sequencing data for O. digyna and C. groenlandica (Table 1 and Fig. 3b)^35,36. HiFi reads were utilized to re-estimate their genome sizes using a k-mer-based method, KMC (version 3.2.1; kmc -k21 -ci1 -cs1000000 and kmc_tools transform histogram -cx1000000) and GenomeScope2 (version 2.0; genomescope.R --max_kmercov 1000000 -k 21), with parameters optimized for highly repetitive genomes^{37,38,39,40,41}. We estimated that C. groenlandica has ~275 Mb with a 0.15% heterozygosity and that O. digyna has ~561 Mb with a 0.43% heterozygosity (Fig. 3c,d). These estimated genome sizes differ from the initial ones estimated by KAT, but are much closer to the genome assembly sizes (see the next section).

Total RNA was extracted using the GeneAll Hybrid-RTM kit and RNA quality was measured using an Agilent Bioanalyzer RNA Nanochip. The library was prepared using the NEXTflex Rapid Directional RNA-Seq Bundle Kit (Perkin Elmer). qPCR was performed to quantify libraries capable of forming clusters and RNA sequencing was performed on the Illumina NovaSeq 6000 platform. We obtained 14 and 25.7 Gb (92.6 M and 170.3 M 151-bp paired-end reads) of RNA sequencing data for O. digyna and C. groenlandica, respectively (Table 1)^35,36.

De novo genome assembly and quality assessment using repetitive sequences and evolutionary conserved genes

For de novo genome assembly using HiFi sequencing data, data processing was conducted as described previously⁴². Specifically, we assembled 45 × and 121 × HiFi raw reads of O. digyna and C. groenlandica, respectively, into contigs using hifiasm (version 0.16.1; default option for O. digyna and hifiasm -s 0.51 for C. groenlandica) and converted two GFA-formatted output files to FASTA-formatted files^43,44. Previously known angiosperm repetitive sequences were masked using RepeatMasker (version 4.1.2.p1; RepeatMasker -species angiosperms -s) and identified O. digyna- and C. groenlandica-specific repetitive sequences using RepeatModeler (version 2.0.3; BuildDatabase and RepeatModeler -database -LTRStruct -ninja_dir) to mask species-specific repeats (version 4.1.2.p1; RepeatMasker -lib -s)⁴⁵.

We further annotated repeats that were not classified by RepeatModeler using a transposon classification tool, RFSB (version 1.0; transposon_classifier_RFSB -mode classify -fastaFile -outputPredictionFile)⁴⁶. The Arabidopsis type telomeric repeat, TTTAGGG, was searched among all ends of the contigs using RepeatMasker outputs, and the telomeric repeat clusters were manually validated²⁶. To determine whether repeat density affected assembly quality, we calculated read depths and repeat compositions of 100-kb binned intervals of contigs ( ≥ 1 Mb for C. groenlandica and ≥ 10 Mb for O. digyna were used). We divided these intervals into the less repetitive or more repetitive groups using median repeat compositions of the intervals (median repetitive ratio: 0.57 for C. groenlandica and 0.69 for O. digyna). Their raw read depths were visualized as violin plots. BUSCO values were calculated using BUSCO and its Eudicot database (version 5.2.2; busco -m genome --auto-lineage-euk)^47,48. Genome assembly quality values (QVs) and completeness were calculated based on k-mer using Merqury (version 1.3; meryl k=21 count output and merqury.sh with default option)⁴⁹.

De novo genome assemblies using HiFi reads represented two partially phased haplotypes (primary and alternative) for both O. digyna and C. groenlandica (Fig. 4a and Table 2)^50,51. The two genome assemblies had genome sizes of 561 Mb and 546 Mb for O. digyna and 250 Mb and 170 Mb for C. groenlandica. The longest contig lengths were 79.5 and 55.8 Mb for O. digyna and 36.4 and 28.2 Mb for C. groenlandica, and N50 lengths were 36.9 and 29.4 Mb for O. digyna and 14.8 and 9.0 Mb for C. groenlandica, respectively (Fig. 4a). These genome assembly sizes were concordant with the estimated genome sizes under the assumption of highly repetitive genomes (Fig. 3c,d).

Table 2 Summary statistics of draft genome assemblies.

Full size table

Indeed, 69.0% (388 Mb) and 68.5% (374 Mb) of O. digyna and 61.9% (155 Mb) and 69.8% (118 Mb) of C. groenlandica were masked as repetitive sequences (Fig. 4b and Table 3). Approximately 96% (371 and 361 Mb) of O. digyna and 90% (142 and 104 Mb) of C. groenlandica repetitive sequences were classified as interspersed repeats, including DNA transposons and retroelements. As repetitive regions are difficult to be well assembled, we analyzed whether more repetitive regions exhibit more skewed read depth distributions by comparing read depths of less and more repetitive regions of the genome assemblies (Fig. 4c). In C. groenlandica, median read depths were similar in the two more and less repetitive regions (108.37 and 102.58, respectively), indicating that most repetitive regions were well resolved (Fig. 4c). However, some regions with high repeat density exhibited significantly high read depth (up to 6 times than the median), suggesting that some highly repetitive regions collapsed in our genome assembly (Fig. 4c). In O. digyna, the two regions showed much more similar read depth distributions and much lower variances than those of C. groenlandica (median: 45.04 for low and 44.00 for high; standard deviation: 5.02 for low and 5.34 for high). It indicates that repetitive regions in the O. digyna assembly were resolved as similar as non-repetitive regions. Moreover, the complete single-copy and complete duplicated BUSCO values of the assemblies were 93.3% and 93.2%, respectively, for the two O. digyna genome assemblies (eudicots_odb10) and 90.4% for the primary genome assembly of C. groenlandica (brassicales_odb10) (Fig. 4d), suggesting their high contiguity. However, the alternative genome assembly of C. groenlandica exhibited a complete BUSCO value of only 51.4%. This implies that it has many missing parts in its genome, concordant with its smaller genome size than that of the primary genome assembly (250 Mb vs. 170 Mb).

Table 3 Summary statistics for repetitive sequences (unit: Mb).

Full size table

These genome assemblies were also assessed using k-mer-based QVs and completeness. O. digyna genome assemblies exhibited over QV60 and 92% completeness (Fig. 4d). In contrast, the primary genome assembly of C. groenlandica exhibited over QV60 and 87% completeness, but its alternative genome assembly exhibited only QV56 and 51% completeness (Fig. 4d). By merging primary and alternative genome assemblies for each species, their QVs were close to QV60, and their completeness was approximately 98% (Fig. 4d). It is noteworthy that the QVs were calculated using raw HiFi reads and HiFi-based genome assemblies, so the values could be overestimated, as they were not independently generated.

Overall, these metrics imply that our genome assemblies are highly accurate at the base level and mostly represent the full genome, except for the alternative genome assembly of C. groenlandica. We recommend only the primary genome assembly of C. groenlandica. These genome assemblies could be further scaffolded at the chromosome level in the near future with the advancement of other long-read and long-range sequencing technologies.

Gene annotation and variant calling

We mapped RNA-seq raw reads to each soft-masked primary genome assembly of O. digyna and C. groenlandica using HISAT2 (version 2.2.1; hisat2-build and hisat2, default options) and sorted the output using SAMtools (version 1.11; samtools sort and samtools index)^52,53. The output BAM file and soft-masked genome were used to annotate genes using BRAKER2 (version 2.1.6; braker.pl --genome --bam --softmasking)^{52,54,55,56,57}. To identify known homologous genes, protein-coding genes were searched in the UniProt database (UniProtKB/Swiss-Prot Release 2021_03 of 02-Jun-2021; UniProtKB/TrEMBL Release 2021_03 of 02-Jun-2021) using MMseqs2 (version 9cc89aa594131293b8bc2e7a121e2ed412f0b931; mmseqs easy-search -s 7)^58,59.

Single nucleotide polymorphisms (SNPs) were identified by mapping HiFi raw reads to the reference genome using minimap2 (version 2.22-r1101; minimap2 -a -x map-hifi); the mapping files were sorted using SAMtools (version 1.13; samtools sort and samtools index); and SNP calling was performed using DeepVariant (version 1.2.0; run_deepvariant --model_type PACBIO --ref --reads --output_vcf)^52,60,61,62. SNPs of genes were analyzed using SnpEff (version 5.1d; java -jar snpEff.jar build -gtf22 -v DBname and java -jar snpEff.jar DBname VCFfile)⁶³. To identify structural variants (SVs), we first aligned our alternative assembly to the reference using Winnowmap2 (version 2.03; meryl count k=19, meryl print greater-than distinct=0.9998, and winnowmap -W -ax asm20 --cs -r2k) and sorted the output using SAMtools (version 1.7; samtools sort and samtools index)^52,64. The output BAM file was analyzed using SVIM-asm to call the SVs (version 1.0.2; svim-asm haploid)⁶⁵. Genes affected by SVs were analyzed using BEDTools (version v2.30.0; bedtools intersect)^66,67.

We identified 43,105 and 29,675 possible protein-coding genes in O. digyna and C. groenlandica, respectively, of which 33,134 and 27,381 were searched in the UniProt database (Table 4). By mapping raw HiFi reads onto our reference genome, we identified 765,012 and 88,959 heterozygous SNPs comprising 0.14% and 0.04% of the total genomic length, respectively (Table 4). The effect of SNPs on genes were categorized as follows: 1,001 high-impact, 22,592 moderate-impact, and 15,594 low-impact SNPs for O. digyna and 253 high-impact, 6,536 moderate-impact, and 6,373 low-impact SNPs for C. groenlandica (Table 4). We found 9,574 deletion SVs, 8,302 insertion SVs, 26 inversion SVs, 97 interspersed duplication SVs, and 83 tandem duplication SVs in O. digyna and 4,079 deletion SVs, 3,577 insertion SVs, 5 inversion SVs, 10 interspersed duplication SVs, and 27 tandem duplication SVs in C. groenlandica (Table 4 and Fig. 4e). Coding sequences in 1,104 and 328 genes of O. digyna and C. groenlandica, respectively, were affected by SVs. Of these, 747 and 147 genes had known homologs (Table 4).

Table 4 Summary of numbers of genes and genetic variants.

Full size table

Circos visualization

The numbers of genes, genetic variants, and repetitive sequences were summed for every 100 kb bin by utilizing the output files of BRAKER2, DeepVariant, SVIM-asm, and RepeatMasker. HiFi read depths for every position were obtained by applying SAMtools to the intermediate BAM files of the DeepVariant analysis (version 1.13; samtools depth -a) and were summed for every 100 kb bin⁵². We visualized the gene, genetic variant, repeat, and read depth densities using Circos (version v 0.69-8; circos -conf configurationFile.txt; plot type = histogram for read depths and plot type heatmap for gene, genetic variant, and repeat densities)⁶⁸ (Fig. 2).

Data Records

All sequencing reads generated in this study have been deposited in the NCBI Sequence Read Archive under accession numbers, SRP404573, SRP427161, and SRP468445^21,35,36. The assemblies have been deposited in GenBank under the accession numbers, GCA_029168935.1 and GCA_040259375.1^50,51. Genome sequences, gene annotation, and amino acid and coding sequences of annotated genes are available at figshare database⁶⁹. Homology search and variant impact analysis for genes and repeat classification data are also available at figshare⁶⁹.

Technical Validation

Illumina DNA sequencing reads and PacBio HiFi DNA sequencing reads were produced > 25 Gb. HiFi reads showed qualified read length distributions (mean read length: 17,289 and 18,429 bp for Oxyria digyna and Cochlearia groenlandica, respectively). Assembly quality was assessed using contig length and BUSCO. The N50 lengths of the primary assemblies were greater than 10 Mb, and the longest contig lengths were 79 and 37 Mb, respectively. BUSCO values exceeded 90%.

Code availability

All scripts, parameters, and options used in the study are described in the Methods. We used publicly available programs and not custom programs.

References

Lee, Y. K. Arctic Plants of Svalbard. 1 edn, (Springer Nature, 2020).
Kim, Y. J. et al. Chronological changes in soil biogeochemical properties of the glacier foreland of Midtre Lovénbreen, Svalbard, attributed to soil-forming factors. Geoderma 415, 115777, https://doi.org/10.1016/j.geoderma.2022.115777 (2022).
Article ADS CAS Google Scholar
van der Kolk, H.-J., Heijmans, M. M., van Huissteden, J., Pullens, J. W. & Berendse, F. Potential Arctic tundra vegetation shifts in response to changing temperature, precipitation and permafrost thaw. Biogeosciences 13, 6229–6245 (2016).
Article ADS Google Scholar
Bjorkman, A. D. et al. Plant functional trait change across a warming tundra biome. Nature 562, 57–62 (2018).
Article ADS CAS PubMed Google Scholar
Speed, J. D. et al. Will borealization of Arctic tundra herbivore communities be driven by climate warming or vegetation change? Global Change Biology 27, 6568–6577 (2021).
Article PubMed Google Scholar
Tolvanen, A., Alatalo, J. M. & Henry, G. H. Resource allocation patterns in a forb and a sedge in two arctic environments—short‐term response to herbivory. Nordic Journal of Botany 22, 741–747 (2002).
Article Google Scholar
Allen, G. A., Marr, K. L., McCormick, L. J. & Hebda, R. J. The impact of Pleistocene climate change on an ancient arctic–alpine plant: multiple lineages of disparate history in Oxyria digyna. Ecology and Evolution 2, 649–665 (2012).
Article PubMed PubMed Central Google Scholar
Wang, Q. et al. Arctic plant origins and early formation of circumarctic distributions: a case study of the mountain sorrel, Oxyria digyna. New Phytologist 209, 343–353 (2016).
Article PubMed Google Scholar
Geraci, J. R. & Smith, T. G. Vitamin C in the diet of Inuit hunters from Holman, Northwest Territories. Arctic, 135–139 (1979).
Porsild, A. E. & Cody, W. J. Vascular Plants of Continental: Northwest Territories, Canada. (National Museums of Canada, 1980).
Ootoova, I., Pitseoiak, J., Joamie, A., Joamie, A. & Papatsie, M. Perspectives on Traditional Health (Interviewing Inuit Elders). (NUNAVUY ARTIC COLLEGE, 2001).
Mooney, H. A. & Billings, W. Comparative physiological ecology of arctic and alpine populations of Oxyria digyna. Ecological Monographs 31, 1–29 (1961).
Article Google Scholar
Heide, O. M. Ecotypic variation among European arctic and alpine populations of Oxyria digyna. Arctic, Antarctic, and Alpine Research 37, 233–238 (2005).
Article ADS Google Scholar
Lee, Y. K. & Elvebakk, A. Handbook of Svalbard Plants. 138 (GEOBook, 2019).
Nawaz, I., Iqbal, M., Bliek, M. & Schat, H. Salt and heavy metal tolerance and expression levels of candidate tolerance genes among four extremophile Cochlearia species with contrasting habitat preferences. Science of The Total Environment 584-585, 731–741, https://doi.org/10.1016/j.scitotenv.2017.01.111 (2017).
Article ADS CAS PubMed Google Scholar
Elven, R., Murray, D. F., Razzhivin, V. Y. & Yurtsev, B. A. Annotated Checklist of the Panarctic Flora (PAF) Vascular plants http://panarcticflora.org/ (2011).
Elven, R., Arnesen, G., Alsos, I. G. & Sandbakk, B. E. Svalbard Flora, https://svalbardflora.no/ (2020).
Stempniewicz, L. Polar bears observed climbing steep slopes to graze on scurvy grass in Svalbard. Polar Research 36, 1326453 (2017).
Article Google Scholar
Nordal, I. Cytology and Reproduction in arctic Cochlearia. Sommerfeltia 11, 147–158 (1990).
Google Scholar
Bruholt, E. A diploid in the Arctic–genetic and morphological variation of Cochlearia groenlandica L, (2019).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP427161 (2023).
Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J. & Clavijo, B. J. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33, 574–576 (2017).
Article CAS PubMed Google Scholar
Červenák, F., Sepšiová, R., Nosek, J. & Tomáška, Ľ. Step-by-step evolution of telomeres: lessons from yeasts. Genome biology and evolution 13, evaa268 (2021).
Article PubMed Google Scholar
Lim, J., Kim, W., Kim, J. & Lee, J. Telomeric repeat evolution in the phylum Nematoda revealed by high-quality genome assemblies and subtelomere structures. Genome Research 33, 1947–1957 (2023).
Article PubMed PubMed Central Google Scholar
Fulnečková, J. et al. A broad phylogenetic survey unveils the diversity and evolution of telomeres in eukaryotes. Genome biology and evolution 5, 468–483 (2013).
Article PubMed PubMed Central Google Scholar
Richards, E. J. & Ausubel, F. M. Isolation of a higher eukaryotic telomere from Arabidopsis thaliana. Cell 53, 127–136 (1988).
Article CAS PubMed Google Scholar
Weiss, H. & Scherthan, H. Aloe spp.–plants with vertebrate-like telomeric sequences. Chromosome Research 10, 155–164 (2002).
Article CAS PubMed Google Scholar
Sýkorová, E. et al. Telomere variability in the monocotyledonous plant order Asparagales. Proceedings of the Royal Society of London. Series B: Biological Sciences 270, 1893–1904 (2003).
Article PubMed Central Google Scholar
Sýkorová, E. et al. Minisatellite telomeres occur in the family Alliaceae but are lost in Allium. American journal of botany 93, 814–823 (2006).
Article PubMed Google Scholar
Petracek, M. E., Lefebvre, P. A., Silflow, C. D. & Berman, J. Chlamydomonas telomere sequences are A+ T-rich but contain three consecutive GC base pairs. Proceedings of the National Academy of Sciences 87, 8222–8226 (1990).
Article ADS CAS Google Scholar
Tran, T. D. et al. Centromere and telomere sequence alterations reflect the rapid genome evolution within the carnivorous plant genus Genlisea. The Plant Journal 84, 1087–1099 (2015).
Article CAS PubMed Google Scholar
Peška, V. et al. Characterisation of an unusual telomere motif (TTTTTTAGGG) n in the plant Cestrum elegans (Solanaceae), a species with a large genome. The Plant Journal 82, 644–654 (2015).
Article PubMed Google Scholar
Mravinac, B., Meštrović, N., Čavrak, V. V. & Plohl, M. TCAGG, an alternative telomeric sequence in insects. Chromosoma 120, 367–376 (2011).
Article CAS PubMed Google Scholar
Doyle, J. in Molecular techniques in taxonomy 283–293 (Springer, 1991).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP404573 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP468445 (2023).
Deorowicz, S., Kokot, M., Grabowski, S. & Debudaj-Grabysz, A. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31, 1569–1576 (2015).
Article CAS PubMed Google Scholar
Kokot, M., Długosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).
Article CAS PubMed Google Scholar
Deorowicz, S., Debudaj-Grabysz, A. & Grabowski, S. Disk-based k-mer counting on a PC. BMC bioinformatics 14, 1–12 (2013).
Article Google Scholar
Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33, 2202–2204 (2017).
Article CAS PubMed PubMed Central Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature communications 11, 1432 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Kim, J. & Kim, C. A beginner’s guide to assembling a draft genome and analyzing structural variants with long-read sequencing technologies. STAR protocols 3, 101506 (2022).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature methods 18, 170–175 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Cheng, H. et al. Haplotype-resolved assembly of diploid genomes without parental data. Nature Biotechnology 40, 1332–1335 (2022).
Article CAS PubMed Google Scholar
Smit, A., Hubley, R. & Green, P. (2015).
Riehl, K., Riccio, C., Miska, E. A. & Hemberg, M. TransposonUltimate: software for transposon classification, annotation and detection. Nucleic Acids Research 50, e64–e64 (2022).
Article CAS PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Kriventseva, E. V. et al. OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic acids research 47, D807–D811 (2019).
Article CAS PubMed Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome biology 21, 1–27 (2020).
Article Google Scholar
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_029168935.1 (2023).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_040259375.1 (2024).
Li, H. et al. The sequence alignment/map format and SAMtools. bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology 37, 907–915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Barnett, D. W., Garrison, E. K., Quinlan, A. R., Strömberg, M. P. & Marth, G. T. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics 27, 1691–1692 (2011).
Article CAS PubMed PubMed Central Google Scholar
Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M. & Stanke, M. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32, 767–769 (2016).
Article CAS PubMed Google Scholar
Hoff, K. J., Lomsadze, A., Borodovsky, M. & Stanke, M. Whole-genome annotation with BRAKER. Gene prediction: methods and protocols, 65–95 (2019).
Brůna, T., Hoff, K. J., Lomsadze, A., Stanke, M. & Borodovsky, M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR genomics and bioinformatics 3, lqaa108 (2021).
Article PubMed PubMed Central Google Scholar
Consortium, T. U. UniProt: the universal protein knowledgebase in 2021. Nucleic acids research 49, D480–D489 (2021).
Article Google Scholar
Steinegger, M. & Söding, J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, H. New strategies to improve minimap2 alignment accuracy. Bioinformatics 37, 4572–4574 (2021).
Article CAS PubMed PubMed Central Google Scholar
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nature biotechnology 36, 983–987 (2018).
Article CAS PubMed Google Scholar
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. fly 6, 80–92 (2012).
Article CAS PubMed PubMed Central Google Scholar
Jain, C., Rhie, A., Hansen, N. F., Koren, S. & Phillippy, A. M. Long-read mapping to repetitive reference sequences using Winnowmap2. Nature Methods 19, 705–710 (2022).
Article CAS PubMed PubMed Central Google Scholar
Heller, D. & Vingron, M. SVIM-asm: structural variant detection from haploid and diploid genome assemblies. Bioinformatics 36, 5519–5521 (2020).
Article CAS PubMed Central Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
Quinlan, A. R. BEDTools: the Swiss‐army tool for genome feature analysis. Current protocols in bioinformatics 47, 11.12. 11–11.12. 34 (2014).
Article Google Scholar
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome research 19, 1639–1645 (2009).
Article CAS PubMed PubMed Central Google Scholar
Kim, J., Lim, J., Kim, M., & Lee, Y. K. Whole-genome sequencing of 13 Arctic plants and draft genomes of Oxyria digyna and Cochlearia groenlandica, figshare, https://doi.org/10.6084/m9.figshare.c.6965802.v1 (2023).

Download references

Acknowledgements

We thank Professor Sangkyu Park and Mr. Youngil Ryu for their help in collecting the leaf samples. This study was supported by the Korea Polar Research Institute (PE22450, PE23530, and PE24060) and funded by the Ministry of Oceans and Fisheries (to Y.K.L.). This work was supported by a National Research Foundation of Korea grant funded by the Korean government (MEST) [RS-2023-00247499] (to J.K.).

Author information

These authors contributed equally: Jun Kim, Jiseon Lim.

Authors and Affiliations

Department of Convergent Bioscience and Informatics, College of Bioscience and Biotechnology, Chungnam National University, Daejeon, 34134, Korea
Jun Kim & Jiseon Lim
Korea Polar Research Institute, Incheon, 21990, Korea
Moonkyo Kim & Yoo Kyung Lee
Department of Life Sciences, Incheon National University, Incheon, 22012, Korea
Moonkyo Kim
Department of Polar Sciences, University of Science and Technology, Incheon, 21990, Korea
Yoo Kyung Lee

Authors

Jun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jiseon Lim
View author publications
You can also search for this author in PubMed Google Scholar
Moonkyo Kim
View author publications
You can also search for this author in PubMed Google Scholar
Yoo Kyung Lee
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.K.: Conceptualisation, Methodology, Formal Analysis, Investigation, Writing-Original Draft, Writing-Review & Editing. J.L.: Methodology, Formal Analysis, Investigation, Writing-Review & Editing. M.K.: Methodology, Formal Analysis, Investigation, Writing-Original Draft, Writing-Review & Editing. Y.K.L.: Conceptualisation, Methodology, Formal Analysis, Investigation, Writing-Original Draft, Writing-Review & Editing, Funding Acquisition, Supervision.

Corresponding author

Correspondence to Yoo Kyung Lee.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kim, J., Lim, J., Kim, M. et al. Whole-genome sequencing of 13 Arctic plants and draft genomes of Oxyria digyna and Cochlearia groenlandica. Sci Data 11, 793 (2024). https://doi.org/10.1038/s41597-024-03569-6

Download citation

Received: 06 December 2023
Accepted: 24 June 2024
Published: 18 July 2024
DOI: https://doi.org/10.1038/s41597-024-03569-6
Springer Nature Limited

Associated content

Genomics data for plant ecology, conservation and agriculture

Collection 20 January 2023

Whole-genome sequencing of 13 Arctic plants and draft genomes of Oxyria digyna and Cochlearia groenlandica

Abstract

Similar content being viewed by others

Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum