Introduction

Macrocystis pyrifera (giant kelp) is the world’s largest macroalga and one of the fastest growing multicellular autotrophs on Earth, increasing in mass by an average of 3.5% per day in favorable locations [1]. It forms extensive subtidal forests on shallow reefs in temperate seas that are among the most productive ecosystems on Earth [2, 3]. Importantly, the high primary production and three-dimensional structure of giant kelp forests provide habitat for hundreds of species, ranging from microscopic invertebrates to different types of fish and mammals [4]. It’s not surprising that Darwin considered giant kelp forests analogous to terrestrial rainforests, owing to the impressive species diversity sustained in both habitats [5].

As a haplodiplontic organism, giant kelp alternates between a macroscopic diploid sporophyte stage and a microscopic haploid gametophyte stage. The diploid sporophyte releases haploid zoospores into the water column that eventually settle on the ocean floor and develop into sexed gametophytes. Female gametophytes release a pheromone that triggers the release and attraction of sperm from the male gametophyte towards their oogonia [6]. After fertilization, young diploid sporophytes develop into the macroscopic adults composing kelp forests [4]. The dispersal distance of the zoospores depends on many factors, including water motion [7,8,9]. Successful colonization depends on opposite sex zoospores settling with millimeters of each other on suitable habitat [10, 11], which is dependent on the size of nearby kelp forests as well as the number and synchronous timing of zoospores released by them [12,13,14,15].

Interest in giant kelp as an aquaculture crop has increased, with the global algal aquaculture market in 2020 producing 35 million tons of algae worth $16.5 billion [16]. Giant kelp is one of the main commercial sources of alginate, a long chain polysaccharide found in the cell walls of brown macroalgae, with uses in food as a thickener and in medicine as a hydrogel [17, 18]. Giant kelp also has a range of other applications including human food, animal feed, cosmetics, pharmaceuticals, and fertilizers [19]. Furthermore, due to its fast growth rate and limited composition of lignin and cellulose, giant kelp has been identified as a potential marine crop for biofuel [20]. However, the large-scale cultivation of giant kelp lags that of other kelps such as Saccharina japonica and Undaria pinnatifida, which are grown in China, Korea, and Japan as human food sources and have undergone selective breeding programs since the 1950s [21,22,23].

Genomics can greatly benefit aquaculture production by assisting in breeding efforts to increase crop productivity, increase the quality of specific compounds in the crop, and increase resistance to stress, disease, and bacterial infection [24]. The first brown macroalga to have its genome completely sequenced was Ectocarpus siliculosus, which has served as a model species [25]. Both Saccharina japonica and Undaria pinnatifida have since had their genomes sequenced, an important step in breeding programs because reference genomes provide an individual’s complete genetic information that can be universally compared across experiments [26,27,28].

Previous research has identified the need for improved cultivars of brown macroalgae and this improvement can be expedited in giant kelp by increasing the availability of genomic tools, such as a quality reference genome and sequencing experiments [29, 30]. However, the available genomic references for giant kelp are thus far limited to a giant kelp transcriptome in 2013, a heavily fragmented genome with an estimated completeness of only 10% based on stramenopile single copy orthologs, and a set of gene models derived from reciprocal blasts against Ectocarpus siliculosus [31,32,33]. Prior research identified a northern hemisphere origin for giant kelp based on phylogenetic analysis of the ribosomal internal transcribed regions [34]. Further molecular dating in conjunction with fossil records estimate that giant kelp emerged as a species ~6 million years ago and initially was found in the colder waters in the Pacific Ocean off the Alaskan coast [35]. The global hotspot of microsatellite genetic diversity for giant kelp is presently in the Southern California Bight in the northeast Pacific, reflecting a Pleistocene glacial refugia for the species [36]. For example, a draft of the presented annotated genome was used to examine the genomic differences between two different morphologies of giant kelp in the Northern and Southern hemispheres [37]. Therefore, in our efforts to support giant kelp domestication,, we improved giant kelp’s genomic resources by assembling a nuclear genome with 92% of sequences scaffolded to chromosomal levels, and investigated three Southern California populations for markers that can be used in future selective breeding models for giant kelp aquaculture.

Results

Genome sequencing, assembly, and annotation

We extracted DNA from a single female haploid gametophyte and sequenced the DNA using PacBio Sequel II technology obtaining 57 GB of long reads with average read length of 15.8 kb, representing a coverage of approximately 100 × of our estimated 513–542 MB giant kelp’s genome. Our de novo assembly generated 1,033 contigs with a total length of 540 MB and N50 of 1.7 MB. After decontamination, scaffolding using Hi-C technology was performed by Phase Genomics and clustered 96.82% of the contigs into 35 clusters, resulting in a final assembly of 35 scaffolds and 188 contigs with a total genome size of 537 MB and N50 of 13.6 MB (Table 1). Scaffolds average 13.3 MB and 34 out of 35 are larger than 4 MB (Fig. 1).

Table 1 Genome statistics comparison between the genomes of Macrocystis pyrifera (assembled in this study), Ectocarpus sp. [25], Saccharina japonica [26], and Undaria pinnatifida [27]
Fig. 1
figure 1

Circos plot of first 34 contigs of the giant kelp genome. Different aspects of the genome are represented in each concentric circle. A Scaffold size in MB. B Gene density heatmap. C Percentage of GC ranging from 45 to 55%. D Nucleotide diversity ranging from 0 to 0.007. E SNP density heatmap. F Tajima’s D values ranging from -2 to 2. G Fst values ranging from 0 to 0.4. H TE density heatmap. Line values are plotted on the same 200 kb sliding window with 40 Kb intervals while heatmaps are plotted over the same 1 MB windows. Heatmaps are in a 1.75 log scale for greater dynamic range at higher values

We used BUSCO for an assessment of gene content completeness of our genome, as it looks for single copy conserved orthologs in a given taxonomic group. The stramenopile BUSCO analysis showed that the giant kelp genome (Macrocystis pyrifera) compared favorably to other published brown macroalgae genomes, with 94 complete BUSCO genes, 1 fragmented gene, and 5 missing genes. We compared the assembled giant kelp genome presented here with other existing giant kelp resources, highlighting the improvements this new assembly offers. This includes a recently published, highly fragmented genome that merely encompass 11 completed stramenopile BUSCO genes, as well as a limited set of gene models that underwent reciprocal blast against Ectocarpus siliculosus and encompassed 78 complete BUSCO genes (Fig. 2) [32, 33].

Fig. 2
figure 2

Comparison of BUSCO assessment of genome completeness based on the stramenopiles_odb10 dataset between the macroalgaes genomes of Macrocystis pyrifera (assembled in this study), Ectocarpus sp. [25], Saccharina japonica [26], Undaria pinnatifida [27], Macrocystis pyrifera gene models from Molano et al. 2022 [34], and genome from Paul et al. 2022 [33]

We report the genome assembly statistics of our giant kelp genome and the three major brown macroalgae genomes, Ectocarpus siliculosus, Saccharina japonica, and Undaria pinnatifida. Our N50 for giant kelp is 13,669,674, which is greater than the N50 for Ectocarpus siliculosus and Saccharina japonica, but less than that of Undaria pinnatifida. Our giant kelp genome also contained fewer missing bases per 100,000 bases out of the other brown macroalgae genomes. The giant kelp estimated genome size based on kmer frequency from the Pacbio corrected reads is 513 MB for k = 25 and 542 MB for k = 31 (Supplementary Fig. 1). Our assembled genome is 537 MB, which falls within this predicted range of genome sizes. The 537 MB genome size is almost half of the upwards estimate from flow cytometry and microspectrophotometry of 686 MB to 1176 MB, which may have been inflated due to the polytenic state of the female gametophytes of giant kelp [38, 39].

Protein-coding genes were predicted using ab initio, homology-based, and transcriptome-based modelers, which after filtering resulted in the prediction of 25, 919 protein coding genes, greater than predictions of Saccharina japonica and Ectocarpus siliculosus, with 18,733 and 16,256 respectively (Table 1 and Fig. 1B) [25, 26]. Brown macroalgae have a UV sex determining system in which sex determining regions (SDR) determine the sex of an individual gametophyte. SDR typically are less gene dense, contain large repeats of DNA, do not recombine and thus can expand in size as the SDR accumulates repetitive DNA [40]. Scaffold 2 contains fewer genes than other scaffolds of similar size (Scaffold 1 contains 1135 genes, Scaffold 2 600 genes and Scaffold 3 1194 genes). In the second half of scaffold 2, the gene density and the pairwise genetic diversity decreases in comparison with the rest of the genome (Fig. 1B). If Scaffold 2 contains the SDR for giant kelp, it should only contain female markers for kelp SDR regions. After blasting the kelp genome using previously identified male and female SDR markers, the second half of scaffold 2 contained the four sex determining markers specific for females in giant kelp [41]. This fits expectation as a female gametophyte was the individual sequenced in this experiment. The presence of the female sex determining markers in combination with decreases in gene density and genetic diversity indicates that scaffold 2 contains the putative SDR region of giant kelp.

Genome comparative analysis

To identify orthologous genes among other relevant macroalgae, we performed a comparative analysis using Orthofinder between Macrocystis pyrifera, Ectocarpus siliculosus, Saccharina japonica and Undaria pinnatifida. A total of 70,317 genes were analyzed, of which 61,267 (87.1%) were assigned to a total of 14,001 orthogroups; of those, 7,660 were present in all four species (Fig. 3A and B).

Fig. 3
figure 3

Analysis of orthologs. A Protein comparative analysis of orthologs between giant kelp and three other relevant macroalgal species (Ectocarpus sp., Saccharina japonica and Undaria pinnatifida) using Orthofinder. Numbers represent shared orthogroups between species. B Species tree inferred by OrthoFinder

Differences in genome size between Ectocarpales and Laminariales have been explained by the expansion of repetitive elements in the larger genomes of Laminariales [42]. Following this pattern, we found repetitive sequences to total 57.6% (309 Mb) (Fig. 1H and Supplementary Table 1) of giant kelp’s genome, similar to what was found in the laminarian U. pinnatifida (52.1%) and contrasting with the lower amount of repetitive sequences in E. siliculosus (22.7%).

Overall synteny is conserved between E. siliculosus and M. pyrifera single copy orthologs, but there are signs of chromosomal rearrangement including the splitting of four chromosomes and fusion of two, which explains the chromosome number difference between the two species (Fig. 4A). As expected, synteny is more conserved between M. pyrifera and U. pinnatifida, both in the order Laminariales, with signals of chromosomal splitting with similar gene density but less chromosomal rearrangement (Fig. 4B).

Fig. 4
figure 4

Synteny between the Macrocystis pyrifera genome (dark green) and the genomes of A Ectocarpus siliculosus and B Undaria pinnatifida (light green). Bands represent clusters of at least 10 single copy orthologs no more than 3 MB apart. Purple bands are potential chromosome splitting or fusion. Gray bands represent scaffolds that share the highest number of orthologs and, therefore, are most syntenic. Red bands represent orthologs in different scaffolds. When multiple bands overlap, bands with fewer number of orthologs superimpose bands with higher numbers of orthologs for easier visualization. Histogram represents density of single copy orthologs in 1 MB windows

We also calculated linkage disequilibrium for the 48 samples for each population, as it can be used to improve selective breeding models. Using r^2 = 0.1 as the LD threshold we estimated the LD block size to be ~5.5 kb for the Catalina Island population and ~ 6 kb for the Camp Pendleton and Santa Barbara populations (Fig. 5). Pairwise genetic diversity ) across the whole giant kelp genome estimated π = 0.0035 (Fig. 1C). Tajima’s D was overall negative throughout the genome, with an average of -1.17522 across 200 kilobase windows (Fig. 1F). Variant effect annotation showed that 2.66% of the variants were found inside exons with 53.1% predicted to be missense, while intergenic variants accounted for 32.73% of all variants (Supplementary Table 2 and Supplementary Fig. 2).

Fig. 5
figure 5

LD decay curve for each sampling population separately

Discussion

Our study presents an improved annotated and scaffolded giant kelp reference genome capable of supporting a genomics approach for the ongoing domestication and conservation efforts for this species. This giant kelp reference genome compares favorably to the three published major brown macroalgae genomes, Ectocarpus siliculosus, Saccharina japonica, and Undaria pinnatifida, with similar N50 values (genome contiguity) and BUSCO scores (genome completeness) [25,26,27, 43]. Additionally, our study presents a comprehensive analysis of whole genome linkage disequilibrium, nucleotide diversity and Tajima’s D. Compared to previous giant kelp genomes, the giant kelp assembly presented here vastly improves on genome contiguity and completeness, in particular when comparing BUSCO scores (94% compared to 11%) [32]. Therefore, we anticipate that the scaffolded giant kelp genome presented here will be the universal reference for future giant kelp genomic projects.

Conclusion

The giant kelp genome presented in this study will assist in ongoing giant kelp domestication and conservation efforts by providing a reference genome that can be used as a comparative benchmark between giant kelp individuals sequenced in other kelp genetic studies. The functional annotations of the genome can help pinpoint the genomic locations of genes of interest for domestication for further genetic variation analysis. The use of a conserved gene model data set for phylogenetic studies may be sufficient in giant kelp as the strong population structure seen in the diploid data concurred with the results from the gene models.

Materials and methods

Data collection and sequencing

Sporophylls (spore-producing blades containing sporangia in diploid sporophytes) were collected from giant kelp individuals attached to the rocky substrate near Catalina Island using SCUBA. Sporophylls were then sent to the University of Wisconsin-Milwaukee, where spores were released into Instant Ocean water. After 15–20 days, single gametophytes were isolated into single genotype cultures. To produce sufficient biomass required to extract 2 μg of high quality and high molecular weight genomic DNA, we cultivated a single haploid female gametophyte from the Catalina Island population (CI_03). To avoid gametogenesis, we grew the culture vegetatively in red light (30 μmol photons m−2 s−1), at 12ºC temperature and 12:12 h (day:night) photoperiod.

To ensure the least amount of contamination during PacBio sequencing, the culture was repeatedly treated with antibiotics until no bacterial colonies would form when plated [44, 45]. High molecular weight DNA was extracted using the protocol of Doyle and Doyle (1987) with minor modifications. Essentially, young gametophytes that had been flash frozen and kept frozen at -80C were ground to a fine powder in a frozen mortar with liquid N2, followed by very gentle extraction in CTAB buffer (that included proteinase K, PVP-40 and beta-mercaptoethanol [46]) for 20 min at 37 °C and 20 min at 50 °C. After centrifugation, the supernatant was gently extracted twice with 24:1 chloroform: iso-amyl alcohol. The upper phase was adjusted to 1/10th volume with 3 M Sodium acetate (pH = 5.2), gently mixed, and DNA precipitated with iso-propanol. DNA was collected by centrifugation, washed with 70% Etoh, air dried for a few minutes, and dissolved thoroughly in 1 × TE at room temperature. Size was validated by pulsed field electrophoresis.

Sequencing of sheared DNA > 30 kb was performed at the Arizona Genome Institute on a Pacbio Sequel II Platform. A SMRTbell Express Template Prep Kit 2.0 was used for library preparation and a Sequel II Binding Kit was used for the sequencing that generated 56 GB of long read data.

Genome assembly

Contamination from sample collection and library preparation has been found in many different reference genomes, and can be the cause for erroneous results in downstream analysis [47]. Contamination has been identified as a concern in genomes of other brown macroalgae, including Saccharina japonica [48]. In order to assemble our genomes using sequences free of most contamination, we loosely aligned our PacBio reads to three different brown macroalgae genomes (Ectocarpus siliculosus, Saccharina japonica and Cadosiphon okamuranus) using Minimap2 map-pb option [49] and excluded from our assembly sequences that did not map to any of the brown macroalgae genomes using Samtools v1.15.1 [50]. We assembled the remaining reads using Canu 1.9 [51] on standard settings resulting in a preliminary assembly of 1,039 contigs containing 539Mbp, which was then polished using Racon v1.5.0 [52].

Because using published genomes to do an initial contamination filter relies on those genomes being contamination free, we analyzed the potential contamination of each contig separately. We split the assembly into individual contigs using faSplit [53] and used diamond blast v2.10 to blast each contig against the Uniprot reference proteome database with an evalue of 1e-15 [54, 55]. We then added the results to the blobtools pipeline using the add --hits command and filtered contigs based on length (contig > 10,000 base pairs), GC content (between 0.35–0.65), coverage (5-300X), and blast classification, keeping contigs identified as phaeophyceae and no-hit. The first round of filtering removed 161 contigs and ~42 MB of sequence. The 870 contigs were then sent to Phase Genomics for scaffolding.

Chromatin conformation capture data was generated using a Phase Genomics (Seattle, WA) Proximo Hi-C 2.0 Kit, which is a commercially available version of the Hi-C protocol [56]. Following the manufacturer’s instructions for the kit, intact cells were crosslinked using a formaldehyde solution, digested using the DPNII restriction enzyme, end repaired with biotinylated nucleotides, and proximity ligated to create chimeric molecules composed of fragments from different regions of the genome that were physically proximal in vivo, but not necessarily genomically proximal. Continuing with the manufacturer’s protocol, molecules were pulled down with streptavidin beads and processed into an Illumina-compatible sequencing library. Sequencing was performed on an Illumina NovaSeq.

Reads were aligned to the 870 contig draft assembly giant kelp genome also following the manufacturer’s recommendations (https://phasegenomics.github.io/2019/09/19/hic-alignment-and-qc.html). Briefly, reads were aligned using BWA-MEM with the -5SP and -t 8 options specified, and all other options default [57]. SAMBLASTER was used to flag PCR duplicates, which were later excluded from analysis [58]. Alignments were then filtered with samtools using the -F 2304 filtering flag to remove non-primary and secondary alignments [50]. Putative misjoined contigs were broken using Juicebox based on the Hi-C alignments [59, 60].

Phase Genomics’ Proximo Hi-C genome scaffolding platform was used to create chromosome-scale scaffolds from the corrected assembly as described in Bickhart et al. [61]. As in the LACHESIS method, this process computes a contact frequency matrix from the aligned Hi-C read pairs, normalized by the number of restriction sites on each contig, and constructs scaffolds in such a way as to optimize expected contact frequency and other statistical patterns in Hi-C data [62]. Approximately 20,000 separate Proximo runs were performed to optimize the number of scaffolds and scaffold construction in order to make the scaffolds as concordant with the observed Hi-C data as possible. Finally, Juicebox was again used to correct scaffolding errors.

After scaffolding and preliminary annotation, we found several contigs that had been discarded from the assembly contained stramenopile_odb10 BUSCO genes. We determined that our initial filtering with the blobtools pipeline had been too strict, mostly due to misclassification of the contigs during the blast analysis. The misclassification of sequences has been shown to be an increasing problem in sequence databases [63, 64]. After manually checking the 161 discarded contigs for potential candidate giant kelp contigs, we added 29 more contigs into the giant kelp assembly, increasing the size of the genome ~38 MB to 537 MB (Supplementary Fig. 3). The additional contigs also raised the BUSCO score using stramenopile_odb10. Unfortunately, the initial 870 contigs had been already scaffolded, and completely re-scaffolding the genome was cost prohibitive.

Genome completeness

To assess genome completeness using single copy orthologs, we used BUSCO v5.2.1 in genome mode and in conjunction with the stramenopile_odb10 dataset to compare our giant kelp genome to the publicly available genomes of Ectocarpus siliculosus, Saccharina japonica, and Undaria pinnatifida [64]. BUSCO (Benchmarking Universal Single Copy Orthologs) searches a genome for single copy ortholog proteins that are in data sets of specific lineages. BUSCO scores from the giant kelp genome can then be compared against other brown macroalgal genomes, with the higher number of BUSCO genes showing a more complete genome and lower number of duplicated genes showing less duplication artifacts from assembly.

Other methods to check genome completeness include genome contiguity, usually measured using the N50 statistic, and comparing genome size to estimated genome size [65]. N50 is the length of the shortest contig or scaffold for which contigs or scaffolds with greater or equal length cover at least 50% of the assembly [66]. Here we compared the same genomes using QUAST v5.0.2 with standard settings in order to generate genome assembly statistics [43].

When assessed using microspectrophotometry, the potential genome sizes of giant kelp gametophytes ranged from 882 MB to 1,176 MB [67]. However, when using flow cytometry, the giant kelp genome was estimated to be 686 MB [68]. This discrepancy may be explained by heterogeneous amounts of nuclear material in giant kelp gametophytes. Giant kelp female gametophytes have been shown to sometimes have double the genetic material compared to most male gametophytes [39]. Other brown macroalgae, such as Saccharina latissima, also have variable amounts of DNA content in their haploid tissue [38]. Therefore, using physical parameters to accurately calculate the genome size of brown macroalgae may require homogeneous mixes of cells with the same levels of DNA content. Computational methods can estimate the genome size of an organism based on approximating the repeat structure of sequenced shotgun reads from a genome [69]. Sometimes, the genome estimates using kmers, or unique subsequences of DNA of length k, may produce different lengths of the genome compared with physical measurements from flow cytometry [70]. Since the physical estimates of the giant kelp genome are not consistent, we compared them against a kmer based estimate. Long reads from Pacbio sequencing have been shown to accurately estimate genome size using kmers as long as the reads have been corrected [71, 72]. To computationally estimate genome size, we used the corrected Pacbio reads generated by Canu with specific settings for Sequel II reads: correctedErrorRate = 0.035 utgOvlErrorRate = 0.065 trimReadsCoverage = 2 trimReadsOverlap = 500 [51]. We then counted the number of kmers found in the corrected reads using the kmercounter program kmc and two different kmer sizes, with k = 25 and k = 31 [71]. Kmer distributions of both k = 25 and k = 31 was plotted, and genome size estimate was done by summing the total number of kmers and dividing by the mean coverage of kmers in the genome.

Annotation

The giant kelp nuclear genome assembly was annotated using the JGI Annotation pipeline [72, 73]. The following steps describe the pipeline in brief. The genome assembly was masked for repeats using RepeatMasker [74] with the RepBase library [75] and the most frequent repeats (more than 150 copies) identified by RepeatScout [76]. Protein-coding gene models were predicted using the following gene modelers: ab initio modelers Fgenesh [77] and GeneMark [78], homology-based Fgenesh+ and GeneWise [79] seeded by BLASTx alignments against the NCBI NR database, and transcriptome- based modelers Fgenesh, combest [80], and Braker [81]. For use in gene prediction, transcriptome assemblies were generated from Illumina RNAseq reads (Accession numbers: SRR5026366, SRR5026588, SRR5026590, SRR5026591, SRR5026593, SRR5026594, SRR3544557, SRR3615022) using Trinity (v2.11.0) [82], and as input to Braker, RNA reads were mapped to the genome using HISAT2 [83]. To select the best representative gene model at each locus, automated filtering was performed based on homology and transcriptome support. In addition, genes with similarity to transportable elements (TE), containing known TE-related Pfam domains, or lie within repeat-masked regions were excluded from the annotated gene set. Finally, the protein sequences of the predicted gene models were functionally annotated using SignalP v3 for signal sequences [84], TMHMM for transmembrane domains [85], InterproScan for protein domains [86], and homologs based on Blastp alignments against the NCBI NR, SwissProt, and KEGG [87] databases.

Annotation of transposable elements was done using RepeatModeler v2.0.3 with LTRStruct option and genomeSampleSizeMax of 81 Mb [88]. The output was classified using RepeatMasker v4.1.2 on standard settings [74]. Genome assembly and annotations are available from the JGI algal genome portal PhycoCosm (https://phycocosm.jgi.doe.gov/Macpyr2) [89].

Comparative analysis

The protein datasets from Macrocystis pyrifera, Ectocarpus siliculosus, Saccharina japonica, and Undaria pinnatifida were used for ortholog analysis with Orthofinder v2.5.4 and visualization of species tree was done with Dendroscope [90, 91]. The position of single copy orthologs between M. pyrifera and E. siliculosus were used to determine synteny between the two genomes. Circos v2.30.1 [91] was used to graph links between orthologs in each genome. Clusters where three or more orthologs have no more than 1 MB of distance between them were graphed as bands linking their respective position in each genome.

SNP calling and population genetics

Raw Illumina reads from 49 giant kelp diploid individuals from three Southern California were downloaded from NCBI (https://www.ncbi.nlm.nih.gov/bioproject/661280). Reads were trimmed of adapter sequences and low-quality tails using Trimgalore [92]. The reads were then aligned to our giant kelp reference genome using Hisat2 v2.1.0 using standard parameters, and the ensuing alignment file was converted to binary and sorted using Samtools v1.9 [50, 83]. Mean depth per individual across the genome calculated using VCFtools v0.1.16 was ~×8, and one individual was removed from the data set due to poor coverage [93]. After removing PCR duplicates, we called variants such as single nucleotide polymorphisms (SNPs) and insertion/deletions (indels), producing a variant call file (VCF) using the GATK4 best practices pipeline [94]. Initial filtering followed the hard filtering suggestions from GATK: “QD < 2.0 || MQ < 40.0 || FS > 60.0 || HaplotypeScore > 13.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0”. We then filtered the VCF further for population genetics analysis on the following parameters: insertions and deletions removed, biallelic SNPs only, pass quality thresholds of 30, site is called in 90% or more individuals, and each site has a mean depth of 3 reads. Initially, there were 25, 374, 044 SNPs and indels in the raw VCF file before filtering. The filters reduced the number of SNPs to 16,019,851 for downstream analysis.

We performed a principal component analysis (PCA) on the genetic variation using the hard filtered VCF as an input into the SNPrelate v1.22.0. We used the SNPrelate standard pipeline and plotted the PCA using ggplot2 v3.3.2 [95, 96]. We then calculated pairwise genetic diversity and Tajima’s D using VCFtools and the hard filtered VCF file across genomic windows of 200 kb with a step interval of 40 kb, and for each population separately [93]. We also calculated FST between the three populations using VCFtools and the hard filtered VCF file across genomic windows of 200 kb with a step interval of 40 kb.

To analyze linkage disequilibrium and population structure we further filtered the VCF file using vcftools and the commands --maf 0.10 and max-missing 1. This keeps alleles that are present in at least 3 individuals and includes only sites that have no missing data. Phasing of the file was done using py-popgen [97], with beagle package implementation [98]. We then calculated and plotted linkage disequilibrium using PopLDDecay. Prediction on the number of populations was done using faststructure with k values from 1–10 with the built-in chooseK.py script. Different k values were plotted with distruct.py.