Introduction

In comparison to the nuclear genome, chloroplast genomes have shown distinctive characteristics that enhance their utility. These features include a significantly reduced size and complexity, as well as a high copy number1. Chloroplast genomes are less susceptible to recombination events, an attribute that enhances their ability to preserve and perpetuate valuable evolutionary imprints2,3. This genomic stability, as well as the specific structural features, make plastomes invaluable tools for understanding the evolutionary history and the phylogenetic relationships of plant species.

The increased accessibility of chloroplast genome sequences has led to a deeper understanding of the evolutionary intricacies underlying these genomes. It has also facilitated the development of sophisticated analytical tools aimed at species identification through comparative and evolutionary studies. Illustrative examples of analyses related to plastid genomes include studies of nucleotide diversity and assessments of nucleotide and amino acid substitution rates measured by the dN/dS ratio4,5,6, as well as the identification of genomic rearrangements associated with inverted repeats (IR), which have proven valuable for phylogenomic investigations across different taxonomic levels, often serving to substantiate or improve the resolution of phylogenetic relationships7,8.

The family Simaroubaceae, order Sapindales9, comprises seven subfamilies, 22 genera, and approximately 189 species with formally recognized names 10. These taxa are predominantly distributed within tropical regions, with the Neotropics being the focal point of the Simaroubaceae species diversity11,12. However, some species are known to extend their distribution range into subtropical or temperate climates 9,13. They are distinguished by the presence of a bitter taste in their bark and branches, which is attributed to the presence of quassinoid compounds. These compounds are synthesized in secretory cells that are distributed throughout their vegetative structures 11,14,15. Species belonging to the Simaroubaceae family have attracted considerable interest due to their manifold multiple traditional medicinal applications. These include improving conditions such as malaria, cancer, helminthiasis, viral infections, gastritis, ulcers, diarrhea, and diabetes. The intrinsic insecticidal and fungicidal properties of these botanical taxa have also been exploited 16,17.

The genus Simarouba Aubl. stands as an established monophyletic clade within the family Simaroubaceae18,19,20. This genus incorporates six neotropical species, including three continental species: Simarouba amara Aubl., Simarouba versicolor A. St.-Hil., and Simarouba glauca DC.21. S. amara boasts a wide distribution across Central and South America, with geographic boundaries extending into Bolivia and Brazil, where the southeastern region represents its distribution limit 21,22,23. Simarouba versicolor is largely restricted to South America, particularly in Brazil and Bolivia, while S. glauca occurs in Central America and Florida (USA) 21,24.

Notably, there is recurrent challenge of discerning between the three continental species, recognized by Cronquist (1994)21 and Franceschinelli et al. (1999)22. According to the authors, this is because these species have very similar morphological characteristics and are parapatric, as evidenced in regions such as Costa Rica, where S. amara and S. glauca occur, and Brazil, where S. amara and S. versicolor may overlap in their geographic distributions. Phylogenetic assessments of the genus have analyzed morphological data from flowers and leaves, and one hypothesis postulates that S. versicolor may have evolved from S. glauca or vice versa, or they may share a more recent common ancestor compared to S. amara21,25.

The three species of Simarouba, S. amara, S. versicolor, and S. glauca, are dioecious plant species, primarily undergoing pollination facilitated by small insects, such as small nocturnal moths26,27,28. The basic chromosome number of the Simaroubaceae family is postulated to be x = 12. The possibility of dysploidy events may lead to the establishment of fundamental numbers, x = 14 and x = 15, within most lineages29. However, the basic chromosome number for the family remains undefined. Cytogenetic records for the genus Simarouba are scarce, with data available for a single species, S. glauca, presenting n = 1530. Along with the limited cytogenetic studies, it is pertinent to emphasize the scarcity of genomic resources available for the family Simaroubaceae. A search NCBI, conducted on February 01, 2024, revealed the presence of complete chloroplast genomes for only five species within the family. Unfortunately, none of these datasets included a representative of the genus Simarouba. This situation highlights the urgent need to generate genomic data for the Simaroubaceae family, with the primary aim of elucidating its molecular and evolutionary characteristics. This will allow for a better understanding of the biology of this taxonomic group.

In this study, a dataset with a total of 17 species belonging to the families Simaroubaceae, Rutaceae, and Meliaceae was used to perform comparative genome analyses among their plastomes. Two new complete plastid sequences were added for representatives of the family Simaroubaceae, specifically S. amara and S. versicolor, and the availability of the assembly and annotation of the genome of S. glauca. The main objectives of this research were the characterization of plastomes within the family, the comparative exploration of plastid genome structures in Sapindales species, the investigation of genomic diversity, and the identification of selection signals within the coding sequences of plastids. It also contributes to a better understanding of the phylogenetic relationships of this botanical group. The results of this study strengthen our knowledge of plastid characteristics and molecular evolution within the order Sapindales. In addition, they provide new insights into the evolutionary relationships within the Simaroubaceae family and improve our understanding of the phylogenetic landscape within the order Sapindales.

Results

Characterization of the chloroplast genomes

The chloroplast genome sizes of S. versicolor, S. amara, and S. glauca are 159,693 bp, 159,906 bp, and 160,294 bp, respectively, and exhibited the typical quadripartite circular structure: LSC (87,419 bp, 87,692 bp, and 88,077 bp) and an SSC (17,516 bp, 17,470 bp, and 17,525 bp) separated by two IR regions (27,379 bp, 27,372 bp, and 27,346 bp) (Fig. 1).

Fig. 1
figure 1

Circular genome maps of the plastomes of S. versicolor, S. amara, and S. glauca, showing genome sizes, LSC, IR and SSC regions. Gene classes are grouped by colors as indicated in the legend at the top and left of the image.

S. amara presented 131 genes (86 CDS, 37 tRNA and 8 rRNA), while S. versicolor and S. glauca presented 132 genes (87 CDS, 37 tRNA and 8 rRNA), similar to other species of the family Simaroubaceae. The genes were categorized according to their functions (Table 1). The discreet difference in the number of genes between the species was due to the pseudogenization of the psbC gene in S. amara.

Table 1 Functional categorization of genes presents in the chloroplast genome of S. amara, S. versicolor and S. glauca. (× 2) duplicate genes, * one intron, ** two introns, ψ-pseudogenization. The psbC gene pseudogenization occurred only in S.amara.

In all three species, a total of 19 duplicated genes are located within the Inverted Repeats (IRs). Among them, eight are CDS, seven are tRNA genes, and four are rRNA genes. Additionally, 18 genes contain introns, and two genes, ycf3 and clpP, possess two introns each. The rps12 gene was trans-spliced, exhibiting three exons and one intron (Figure S4. Supplementary material).

We considerate pseudogenes one of the copies of Ψrpl22 (boundary LSC/IRa), Ψycf1 (boundary IRb/SSC), ΨpsbC (only in S. amara) and ΨinfA. Despite the presence of start and end codons in the pseudogenized copies of the Ψrpl22 and Ψycf1 genes, both exhibited notable nucleotide and amino acid substitution rates, as well as insertion/deletion events (Figure S5). These variations contributed to size discrepancies observed across different species. In addition, the Ψrpl22 (LSC/IRa) is absent in Ailanthus altissima (Mill.) Swingle, another Simaroubaceae species from China and cultivated in other countries mainly in temperate zones (Fig. 3).

The ΨpsbC gene in S. amara underwent a reduction in size, decreasing from 1422 bp to 586 bp. This reduction in size resulted from the insertion of a premature stop codon, which occurred due to the deletion of a single base at position 582 within the gene (see Supplementary Figures S1, S2 and S3 for details). Alternative start codons, GTG and ACG, were found in the rps19 and ndhD genes, respectively.

Benchmarking

The cpDNAs of the 17 species were systematically compared in terms of sizes and gene counts (Table S1). The cpDNA size ranged from 161,172 bp (Clausena excavata) to 157,434 bp (Ruta graveolens). The LSC region exhibited a range from 88,382 bp (A. altissima) to 85,387 bp (R. graveolens), while the IRs spanned from 2868 bp (R. graveolens) to 27,923 bp (B. javanica), and the SSC region ranged from 17,470 bp (S. amara) to 18,709 bp (Melia azedarach). The total number of genes varied between 129 (Cedrela odorata) and 134 (C. excavata), with the number of tRNA genes ranging from 36 (C. odorata) to 39 (C. excavata).

Long and short repeat regions

The predominant classes of long repeat sequences were identified as palindromic and forward, constituting 52.4% and 39.3%, respectively. S. amara and Ailanthus excelsa exhibited an absence of reverse and complementary repeats (Fig. 2a). In 11 out of the 17 surveyed species, the majority of long repeats were localized within the IRs (Fig. 2b). The total number of microsatellite repeats ranged from 256 in Leitneria floridana to 185 in R. graveolens. The mononucleotide repeat was the most frequent (Fig. 2c) and the penta and hexanucleotide repeats were absent in some species. The LSC region harbored the largest number of microsatellite repeats, followed by the SSC and IR regions (Fig. 2d). The most frequent motif types were those rich in A/T, such as the repetition of the mononucleotide type A/T and the di-, tri- and tetranucleotide motifs, AT/AT, AAT/ATT and AAAT/ATTT (Supplementary Table S2).

Fig. 2
figure 2

(a) Types of long repeats (forward, reverse, complementary, palindromic). (b) Distribution of long repeats in the SC and IR regions. (c) Types and total number of microsatellites. (d) Distribution of microsatellites in the SC and IR regions. The purple, green, and yellow highlights are the species of the families Simaroubaceae, Rutaceae, and Meliaceae, respectively.

Boundary regions and genomic rearrangements

The chloroplast genome of the 17 species exhibited slight differences in the SC/IR boundary regions (Fig. 3). The rpl22 gene is located at the LSC/IRb boundary in most species, with a size between 489 and 396 bp, and extends into the IR between 6 and 322 bp.

Fig. 3
figure 3

Boundaries between regions of the chloroplast genomes of the 17 species. LSC-IRb: boundary between large single copy (LSC in gray) and inverted repeat B (IRb in yellow). IRb-SSC: boundary between inverted repeat B and small single copy (SSC in green). SSC-IRa: boundary between SSC and inverted repeat A (in yellow). IRa-LSC: boundary between inverted repeat A and LSC. The purple, green, and yellow highlights are the species of the families Simaroubaceae, Rutaceae, and Meliaceae, respectively. ψ-pseudogenes (rpl22 and ycf1).

At the border of IRs/LCS the Ψrpl22 pseudogene exhibited size range of 177–429 bp, qualifying it as a pseudogene (ψ) in this study. In A. altíssima, the Ψrpl22 gene was absent, while in Brucea javanica and C. excavata it was fully localized within the IRs, measuring 486 bp and 177 bp, respectively (Fig. 3). At the IRb/SSC boundary, the Ψycf1 gene showed size variations ranging 1083 bp to 1488 bp, while at the SSC/IRa boundary it showed a size of 5697–5343 bp.

Progressive alignment using the MAUVE revealed the presence of three collinear blocks (LCBs) (Fig. 4). The LCBs harboring tRNA, rRNA and CDS genes showed the absence of genomic rearrangements in the plastomes, as a consequence, the position, direction, and order of the genes were preserved. We also highlight the formation of a collinear block (purple block) shared only by the species of the genus Simarouba, located in the ycf1 gene (SSC/IRa).

Fig. 4
figure 4

Progressive alignment performed using the MAUVE tool v.20150226 in the 17 chloroplast genomes representing the Sapindales family, including S. amara, S. versicolor, and S. glauca. Collinear blocks (LCB) in each species are represented as blocks of green, purple, and pink colors connected by lines. The rRNA, tRNA, and CDS annotations are displayed in red, green, and white, respectively, below the collinear blocks.

Nucleotide diversity (π), selection signal, and codon usage bias

The genomic regions exhibiting the highest nucleotide diversity (π) across the complete genomes were identified mainly within the rps16, ndhF, matK, rpl32, and ycf1 genes, displaying values exceeding twice the median (Fig. 5). Hotspots of nucleotide diversity in the intergenic regions were identified upon examination of the species of the Simaroubaceae family and the Simarouba genus (Fig. 5b, c). It is noteworthy the identification of a nucleotide diversity hotspot in the intergenic region petA-psbJ, persists across both analyses. In the order Sapindales, family Simaroubaceae, and genus Simarouba, the regions of the ndhF and ycf1 genes showed a nucleotide diversity hotspot (Fig. 5a–c). The regions exhibiting low values of nucleotide diversity (π = 0.01) correspond to genes associated with ribosomal RNA.

Fig. 5
figure 5

Nucleotide diversity (π) considering (a) the complete genomes of the seventeen species; (b) the complete genomes of the nine species of the family Simaroubaceae; (c) the complete genomes of the three species of the genus Simarouba. The blue highlighted π peaks indicate intergenic regions. The red horizontal line corresponds to twice the median of the π values.

The analysis of selection signals, as indicated by the dN/dS ratio (ω), revealed that the ribosomal protein large gene, rpl23, exhibited a dN/dS ratio of 2.33, suggesting positive selection (ω > 1) acting on this gene (Fig. 6). In contrast, other CDS demonstrated signatures of negative selection (ω < 1). This observation indicates that natural selection has preserved the amino acid sequence of proteins encoded by these genes. It is noteworthy, however, that genes responsible for encoding other ribosomal proteins and proteins constituting photosystems exhibited minimal changes in their nucleotide compositions.

Fig. 6
figure 6

Selection signal analysis in 77 CDS, dN/dS ratio in the seventeen species analyzed. The CDS were grouped into functional categories and ordered in descending order of the dN/dS values. The dashed horizontal line indicates dN/dS ratio = 1.

The prevalent amino acids in S. amara, S. versicolor and S. glauca were leucine, isoleucine and serine. Codon usage bias, measured by the Relative Synonymous Codon Usage (RSCU), was observed across a majority of amino acids. In S. amara the codons with the highest RSCU values were UUA-Leu, AGA-Arg, in S. versicolor AGA-Arg, GUU-Val, and in S. glauca, AGA-Arg and UUG-Leu (Figure S6). It should be noted that approximately 30 codons exhibited RSCU values greater than one (RSCU > 1), and most of them terminated with base U or A.

Phylogenetic analyses

Phylogenetics relationships exhibited the consistent formation of three distinct groups corresponding to the families Simaroubaceae, Rutaceae, Meliaceae, and the outgroup, Sapindaceae (Fig. 7). Regarding the Simarouba genus, the phylogenetic relationships revealed that S. versicolor is phylogenetically closer to S. glauca than S. amara (Fig. 7).

Fig. 7
figure 7

Reconstruction of the phylogenetic tree among 19 species within the order Sapindales, using 78 CDSs inferred through the maximum likelihood method, according to the TVM + F + I + G4 substitution model and estimated by Bayesian analysis criteria (BIC). Nodes represent the bootstrap values. The colored shades represent the analyzed families, and the vertical bars to the right correspond to the botanical tribes or subfamilies. Symbols according to the legend denote missing and pseudogenized genes.

Discussion

The chloroplast genomes of S. amara, S. versicolor, and S. glauca exhibited a circular, quadripartite structure, characterized by two copies of the repeated inverted region (IR) separated by single-copy regions (LSC and SSC). The size, gene order, and content of these plastomes were similar to those observed in other species within the family Simaroubaceae31,32,33 and across the order Sapindales13,34,35. Although electron microscopy-based analyses have revealed cpDNAs in linear, multi-branched structures in certain angiosperm species3, the predominant configuration of angiosperm cpDNAs is typically circular, and the majority of these circular molecules range in size from 135 to 160 kb 36,37.

A slight variation in the amount of CDS, tRNA and CG content was observed within the group (Table S1). The variation in the amount of CDS and tRNA observed in our analyses was due to at least one of the following factors: i) expansion and retraction of IR regions resulting in duplication and pseudogenization; ii) pseudogenization events of the psbC gene. The hypothesis of plastidial gene transfer to the nucleus has been used to answer some questions about pseudogenization or absence of genes in the plastidial genome, such as the rpl22 gene in the genera Passiflora, Castanea, Prunus, Theobroma 38,39, the accD gene in Primulaceae 40. However, few characterized functional genes have been transferred in angiosperms 39, and experiments are needed to search for functional copies of these genes in the nuclear genome (for review, see Millen et al.,200141 & Ueda et al.,200742). Daniell et al. (2016)43 add that once the gene is transferred to the nucleus, it must acquire sequences to regulate its transcription, as well as the signaling of the final peptide for its correct targeting, that is, from the cytoplasm to the chloroplast.

In S. amara, the pseudogenization of the psbC gene was observed, a gene which constitutes the photosystem II complex and is responsible for encoding the CP43 protein. The CP43 plays a role in energy transfer from the outer antenna complex to the reaction center, contributing to a light-harvesting, PSII stabilization, and as a tertiary electron donor and acceptor44,45,46. Intergenic regions proximal to the psbC gene have been identified with elevated rates of nucleotide diversity as has identified by Jo et al., 201947, Maurya et al., 202348, and Xu et al., 202349. Additionally, the gene has shown a positive selection signal, indicating nucleotide and amino acid sequence changes in selected angiosperm species 50,51. It's worth noting that the CP47 protein encoded by the psbB gene performs functions similar to those of the CP43 protein52,53,54.

The use of alternative start codons, specifically GUG in the rps19 gene and ACG in the ndhD gene, was identified in both S. amara and S. versicolor. Our data corroborate studies in species of the same family31,33,55 and in other angiosperm families such as Passifloraceae, Arecaceae and Apocynaceae56,57,58. In both prokaryotes and eukaryotes, some genes are known to be initiated with non-AUG codons derived from a single base substitution59, such as the GUG-Val, UUG-Leu, and AUU-Ile codons in Escherichia coli, and the ACG-Thr codon in adeno-associated codons and Sendai viruses60,61.

The trans-splicing pattern observed in the rps12 (ribosomal protein) gene, and the loss of its intron found in the species analyzed have been documented in other taxonomic groups, encompassing both in angiosperms and gymnosperms62,63. This bipartite gene is encoded at two locations in the plastidial genome, giving rise to two mRNA precursors that subsequently undergo trans-splicing events to form the complete functional transcript, molecular phenomenon known as exon shuffling (for more information see Long et al., 200364). The occurrence of trans-splicing in this rps12 gene was first documented in Nicotiana tabacum65,66. Across all 17 species examined, palindromic and direct repeats were the predominant types of long repeats. Comparable findings have been reported in species belonging to other taxonomic groups, such as Malpighiales 67,68 and Fabales 6,69. The microsatellite data revealed the presence of numerous SSR repeat sites in the analyzed plastid genomes. These data can be used as a valuable resource for subsequent analyses of polymorphisms in plastid microsatellite regions, facilitating investigations into evolutionary dynamics, genetic diversity, and population structure70,71,72. Furthermore, they can contribute to delineate the conservation status of the species within their geographic range73,74.

Analysis of the boundary regions revealed that the rpl22 and ycf1 genes are situated at the single copy/IR boundary in the majority of species. Variations in its size were observed in both pseudogenized and non-pseudogenized copies of these genes (Fig. 3). Similar findings have been documented in other species within the order Sapindales, including representatives of the families Sapindaceae75,76 and Fabaceae77. In these taxa, pseudogenization events involving one of the copies of the rpl22 and ycf1 genes have been reported. The authors attributed these occurrences to cases of partial gene duplication and events involving contraction and retraction of the IRs. These molecular dynamics likely contributed to the reduction in the size of the gene copies, exemplified, in our study by ycf1, which ranged from 1092 to 1600 bp, and the other copy of the same gene, measuring approximately 5300 bp. In the boundary regions of plastid genomes, episodes of IR contraction and expansion have been frequently reported in angiosperms78,79,80. These dynamic events play a role in the observed variation in both the size of chloroplast genomes and those of genes located at or near genome boundaries as reported by Jansen & Ruhlman (2012)39 and Dobrogojski, et al. (2020)36.

Genomic rearrangement data support the notion that both the order and orientation of genes within angiosperm chloroplast genomes have been relatively conserved across evolutionary history81,82,83, and high collinearity has been observed in Sapindales. While instances of genomic rearrangement events have been documented84,85,86, some cases have been attributed to the loss or reduction of inverted repeats (IRs)87,88,89. Palmer (1983)90 and Mower & Vickrey (2018)3 suggest that the presence of IRs promotes the stabilization of plastome structure, perhaps imposing structural constraints and mitigating major genomic rearrangements.

Our findings indicate the presence of nucleotide diversity hotspots within the rpl32, ycf1 and matK genes across and intergenic regions petA-psbJ, a majority of interspecific group comparisons (Fig. 5). The rpl32 gene encodes a protein component of the large ribosomal subunit91. The ycf1 gene encodes a protein integral to the Translocon on the Inner Chloroplast membrane (TIC) complex located in the inner membrane of the chloroplast92,93. Lastly, the matK gene encodes a putative maturase responsible for catalyzing the removal of introns from premature RNAs94. Nucleotide diversity hotspots within the rpl32, ycf1, and matK genes, and have been extensively documented in various taxonomic groups95,96,97. Recent research recommends their utility in species identification (DNA barcoding)98,99,100 and in the study of phylogenetic relationships101,102. The intergenic regions petApsbJ have been discerned to exhibit elevated nucleotide substitution rates in species belonging to the order Sapindales103. Species of the genus Simarouba exhibit a parapatric distribution, and challenges in species identification have been documented, particularly in the frequent confusion between S. glauca and S. versicolor with S. amara by Franceschinelli et al. (1999)22. By analyzing the nucleotide diversity of the genus (Fig. 5c), we identified 13 nucleotide diversity hotspots. These regions hold promise for molecular differentiation among species within the genus Simarouba (DNA barcode), in the management and conservation of genomic resources, and in the analysis of phylogenetic relationships.

The rpl23 gene is under positive selective pressure in the species examined in this study and is located in the IR region, belonging to a gene family responsible for encoding proteins that contribute to the structural composition of the large ribosomal subunit. Hypervariable regions and signals of positive selection have been found in the rpl23 gene across the Celastraceae, Styracaceae, and Fabaceae families104,105,106. In addition, instances of rpl23 gene loss and pseudogenization have been documented in species belonging to the Podostemaceae107, Lauraceae108, Araceae109, and Hypericaceae86 families. Therefore, our results, combined with other evidence for positive selection on the rpl23 gene, call for further evolutionary studies of this gene within the order Sapindales.

The preferential use of amino acids identified in our investigation, particularly Leu, Ile, and Ser, showed patterns similar to other angiosperms4,110,111. In prokaryotes, branched-chain amino acids, including Leu and Ile, are involved in protein synthesis and maintenance of metabolic processes112,113. In addition, Ile is often found in the formation of beta sheets (β-sheets), while Leu contributes to the formation of leucine α-helices, loops, and zippers114. In the three species studied, S. amara, S. versicolor and S. glauca, AGA-Arg and UUA-Leu were the most frequently used codons (Figure S6). Our findings are consistent with previous investigations of codon usage bias in Rutaceae115 and other families of the order Sapindales116,117. According to Prosdocimi & Ortega (2007)118, the amino acids arginine and leucine have codons responsible for maintaining protein stability against DNA mutations. Moreover, factors such as translation optimization, genes with high expression rates, nucleotide composition of the genome, and the less stringent matching in the first base of tRNA (wobble base pairing) can influence the preferential use of specific codons119,120,121,122,123.

Phylogenetic analysis reveals that S. versicolor exhibits a closer evolutionary relationship to S. glauca more than to S. amara. The distinct geographic distributions and vegetation preferences of these three continental species warrant emphasis. S. versicolor, native to South America, and S. glauca, found in Central America, are frequently found in dry forests and savannas. In contrast, S. amara, distributed throughout Central and South America, is usually found in riparian and ombrophilous forests21,22,23. Both, S. versicolor and S. glauca occupy similar drier vegetation types than S. amara, raising the possibility that the tropical forests of Panama and South America, particularly the Amazon region, may act as an ecological barrier, influencing the geographical separation of these two species. This observation supports the hypothesis that S. glauca and S. versicolor may share a common evolutionary ancestor, or alternatively, that S. versicolor may have evolved from S. glauca and vice versa21. The families Simaroubaceae, Meliaceae, and Rutaceae exhibit a polytomy, characterized by unresolved phylogenetic relationships, according to APG (2006)9. This finding is consistent with recent chloroplast genome studies within the botanical group conducted by Liu et al. (2021)124, Majure et al. (2021)19 and Yang et al. (2022)76. Therefore, the phylogenetic insights presented in this study not only contribute to the discourse on the evolutionary relationships of S. amara, S. versicolor, and S. glauca within the family Simaroubaceae, but also enrich the discussion on the evolutionary relationships within the order Sapindales.

In this study, the chloroplast genomes of S. amara, S. versicolor, and S. glauca were successfully sequenced. The gene order (collinearity) and structure closely resembled those observed in chloroplast genomes of other species within the family Simaroubaceae and the order Sapindales, with the absence of genomic rearrangements. Pseudogenization events of a gene important for the photosynthetic pathway were detected. In addition, the investigation revealed highly divergent genomic regions, future candidates for the development of DNA barcode genetic markers, particularly within the genus Simarouba. We found that the majority of the CDSs are under negative pressure selection and observed the preferences in codon and amino acid usage and amino acids in the chloroplast of S. amara, S. versicolor and S. glauca. Phylogenetic analyses based on molecular plastome data elucidated evolutionary relationships, positioning S. versicolor closer to S. glauca than to S. amara. Furthermore, Simaroubaceae exhibited a close evolutionary relationship with Rutaceae. This work not only contributed novel genomic resources to the Simaroubaceae family, but also advanced comparative genomics and the understanding of evolutionary dynamics within the plastomes of Sapindales representatives. The findings presented herein also provide valuable insights into the discussions of phylogenetic relationships within the genus Simarouba and the order Sapindales.

Methods

Sequencing, assembly, and characterization of the chloroplast genome

Fresh leaves of S. amara were collected in the Serra of Pirineus, city of Pirenópolis, and of S. versicolor in the city of Goiânia, both in the state of Goiás, Brazil (geographical coordinates: − 48.84820, − 15.80294 and − 50.15408289, − 15.92941545, respectively). Total DNA was isolated using the CTAB protocol 2% 125. Library preparation was performed using the Illumina DNA Prep Kit and sequencing was performed on the Illumina MiSeq platform in paired-end using the V3 600 cycles and V2 300 cycles kits for sequencing S. amara and S. versicolor, respectively. Raw reads were subjected to a sequencing quality control processing step using the Trimommatic software126. Raw reads of S. glauca and Ailanthus excelsa were obtained from databases. For S. amara reads the parameters used were SLIDINGWINDOW: 4:20, CROP:289, HEADCROP:15, ILLUMINACLIP: NexteraPE-PE.fa:2:15:10, LEADING:10, TRAILING:10, MINLEN:100. For S. versicolor, S. glauca, and A. excelsa, the following parameters were applied: SLIDINGWINDOW: 4:15, CROP:150, HEADCROP:15, ILLUMINACLIP: NexteraPE-PE.fa:2:15:10, LEADING:10, TRAILING:10, MINLEN:100.

De novo assemblies of the chloroplast genomes were performed using the software NOVOPlasty v. 4.2.1127, using the rbcL gene of S. amara as seed (access number EU043036.1) for the three Simarouba species and the rbcL gene of A. altissima (access number EU043036.1) for A. excelsa. The genes were predicted and annotated using the GeSeq tool128 on the ChloroBox-MPI-MPI online version platform. The annotations were manually checked for correct annotation of the single copy (LSC and SSC) and inverted repeat (IR) regions, as well as the start and end codons of the CDS. This step was performed using the tools Geneious Prime v. 2021.1.1129 and Ugene v.48.1130. After the manual inspection and curation of the annotations, circular maps of the chloroplast genomes were drawn using the program OGDRAW v.1.3.1131.

Benchmarking

For comparative analyses, we used cpDNAs from the species assembled in this work, S. amara, S. versicolor, S. glauca and A. excelsa, and 13 other species from three families of the order Sapindales, for a total of 17 species (Table S1). Note that the plastomes of all species were annotated with the same parameters for comparison purposes.

Repetitive regions

Long repeats were detected on the plastomes using the tool Online Reputer132. The parameters were: Hamming distance = 3 and minimum repeat size 20 bp. The single sequence repeat (SSR) was identified using the program MISA Web Version133, and the parameters were: ≥ 8 mononucleotide, ≥ 5 dinucleotide, ≥ 4 trinucleotide, ≥ 3 tetra, penta and hexanucleotide. We consider the maximum allowable size of 100 bp between two microsatellites, to be registered as a composite microsatellite. The distribution of microsatellite patterns was compared and analyzed graphically.

Boundary regions and genomic rearrangements

After annotation and curation of the plastomes of all species included in this study, information on the gene composition of the boundaries of IRs and SCs was obtained using Geneious Prime v.2023.1.2. To determine the presence of genomic rearrangements, we used the tool Mauve v.2.4.0134, whose analyses were performed from a multiple alignment of the 17 genomes.

Nucleotide diversity, selection signal and codon bias usage

To quantify the degree of polymorphism among populations or species, proposed the metric of nucleotide diversity (π)135. This measure corresponds to the estimate of the number of nucleotide substitutions obtained by comparing sequences from different individuals and species. Nucleotide diversity (π) was calculated to identify the sites of mutational hotspots. For this purpose, the analyses were performed in three groups, the first including all 17 species of the three families, the second including only the 9 species of the family Simaroubaceae, and finally we analyzed the nucleotide diversity within the genus Simarouba. The complete plastomes were aligned using the MAFFT v.7136, and the nucleotide diversity was calculated using the tool DnaSP137 with the following parameters: sliding window 600 bp e step size 200 bp. The generated data were plotted using the R environment138.

The presence of a selection signal in species of the order Sapindales with nonsynonymous (dN) and synonymous (dS) substitutions was estimated by the value of omega (ω) = dN/dS, which was obtained using 77 CDS. We aligned the CDS using MAFFT v.7136 and then estimated ω using the Codeml tool included in the PAML 4.9 program139. We first estimated the value of ω under two neutral models, using species trees (runmode = 0; model = 0; NSsites = 0) and no tree information (nullmode = 0; NSsites = 0), and then contrasted groups of models (neutral and positive selection model), M2a contrast (positive selection model; model = 0; NSsites = 2) x M1a (neutral selection model; model = 0; NSsites = 1) and M8 (positive selection under beta distribution; model = 0; NSsites = 8) x M7 (neutral model under beta distribution; model = 0; NSsites = 7). The p-value and FDR (false discovery rate) were calculated for each contrast and for each gene (Supplementary Table S2).   Values of ω > 1 indicate positive selection, ω = 1 indicates neutral selection and ω < 1 indicates negative selection (Jeffares et al., 2015). To identify the codon bias usage, the RSCU index (Relative Synonymous Codon Usage)140 was obtained using the MEGA X v.10.2.2 tool141, and the results were plotted in heat map using the R environment138 (pheatmap package v.1.0.12).

Phylogenomics analyses

Phylogenomic analyses were performed using 78 CDSs present in the 17 species. The outgroup consists of two species of the family Sapindaceae (Pometia tomentosa and Dodonea viscosa). The CDSs were aligned using the MAFFT v.7 tool and then concatenated into a single matrix using the catfasta2phyml software142. The Gblocks v.0.91b tool143 was used to identify conserved blocks, eliminate poorly aligned positions and divergent regions of non-homologous sites, and eliminate saturated substitutions. The phylogenetic tree was constructed using the IQ-TREE multicore version 1.6.12 tool144 and the best evolutionary model was TVM + F + I + G4 inferred using the ModelFinder tool, according to the Bayesian information criterion (BIC). The maximum likelihood method was used to access the evolutionary relationships, with 1000 replications.