A global survey of prokaryotic genomes reveals the eco-evolutionary pressures driving horizontal gene transfer

Dmitrijeva, Marija; Tackmann, Janko; Matias Rodrigues, João Frederico; Huerta-Cepas, Jaime; Coelho, Luis Pedro; von Mering, Christian

doi:10.1038/s41559-024-02357-0

A global survey of prokaryotic genomes reveals the eco-evolutionary pressures driving horizontal gene transfer

Article
Open access
Published: 05 March 2024

Volume 8, pages 986–998, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue Submit your manuscript

A global survey of prokaryotic genomes reveals the eco-evolutionary pressures driving horizontal gene transfer

Download PDF

13k Accesses
5 Citations
69 Altmetric
Explore all metrics

Abstract

Horizontal gene transfer, the exchange of genetic material through means other than reproduction, is a fundamental force in prokaryotic genome evolution. Genomic persistence of horizontally transferred genes has been shown to be influenced by both ecological and evolutionary factors. However, there is limited availability of ecological information about species other than the habitats from which they were isolated, which has prevented a deeper exploration of ecological contributions to horizontal gene transfer. Here we focus on transfers detected through comparison of individual gene trees to the species tree, assessing the distribution of gene-exchanging prokaryotes across over a million environmental sequencing samples. By analysing detected horizontal gene transfer events, we show distinct functional profiles for recent versus old events. Although most genes transferred are part of the accessory genome, genes transferred earlier in evolution tend to be more ubiquitous within present-day species. We find that co-occurring, interacting and high-abundance species tend to exchange more genes. Finally, we show that host-associated specialist species are most likely to exchange genes with other host-associated specialist species, whereas species found across different habitats have similar gene exchange rates irrespective of their preferred habitat. Our study covers an unprecedented scale of integrated horizontal gene transfer and environmental information, highlighting broad eco-evolutionary trends.

Historical and Epistemological Perspectives on What Horizontal Gene Transfer Mechanisms Contribute to Our Understanding of Evolution

Migration and horizontal gene transfer divide microbial genomes into multiple niches

Article Open access 23 November 2015

Evolutionary change and phylogenetic relationships in light of horizontal gene transfer

Article 21 April 2015

Main

The gene content of microbial genomes constantly changes through gain and loss of genes¹. Gene gain through horizontal gene transfer (HGT) in particular is a driving force in prokaryotic genome evolution^1,2, and most genes have been shown to undergo HGT at least once in their evolutionary history^3,4. However, foreign genes can be a burden or even toxic to the recipient⁵, typically persisting only as long as is imposed by fluctuating environmental circumstances. In a simple two-class model of gene evolution⁶, such genes display high rates of turnover. In contrast, other foreign genes may provide sufficient benefit to the recipient, outweighing maintenance costs and persisting long enough to be detected in present-day genomes through computational methods⁷.

Multiple conceptually diverse approaches for computational HGT detection exist⁸. Detecting genomic regions with abnormal sequence composition has the advantage of requiring the recipient genome only. However, such detection is restricted to recent transfer events due to gene amelioration, whereby foreign DNA evolves to resemble that of its host species^9,10. Alternatively, HGT can be detected through comparing genomes and identifying discrepancies between gene and species evolutionary history. These comparative genomics approaches include the detection of nearly identical sequences in genomes from different species^{11,12,13,14,15,16} or the more computationally intensive modelling of gene evolution through processes such as gene duplication, transfer or loss^3,4,17,18,19. The next-generation sequencing revolution has enabled HGT detection through comparative genomics approaches by enabling an abundance of publicly available, high-quality prokaryotic genomes in curated databases such as proGenomes²⁰.

Previous large-scale surveys of HGT across different environments have showcased the contribution of shared ecology to HGT^11,13,14,21. Generally, inter-environmental transfers were found to be rare, with the possible exception of antibiotic resistance genes¹¹. The importance of shared ecology in determining HGT frequency can be explained from two different perspectives. On the one hand, similar environments may exert similar pressures, prioritizing the persistence of specific functional traits. On the other hand, as most HGT mechanisms require physical proximity between the donor and the recipient²², co-occurring within the same environment may simply provide more opportunities for HGT.

In this study, we aim to elucidate both ecological and evolutionary factors that contribute to a successful gene gain event through HGT. Using the gene content of 8,790 species’ pangenomes²⁰ clustered into over a million gene families, we ran RANGER-DTL to model duplication, transfer and loss events in gene evolution²³. In parallel, we searched for these species in the MicrobeAtlas database (https://microbeatlas.org/), obtaining more than one million microbial community profiles from diverse, globally distributed environments. By following species presence and abundance profiles across this dataset, we show that co-occurrence, abundance and dispersal patterns all determine HGT success. By looking at functionality and ubiquity of transferred genes, we observe that recent transfers are enriched for genes involved in transcription, replication and repair, and in antimicrobial resistance genes. By comparison, old transfers are enriched for genes involved in amino acid, carbohydrate, and energy metabolism, and are more likely to concern genes that are present in nearly all members of a species. This study provides an overview of global ecological trends in HGT.

Results and discussion

Extensive contribution of HGT to prokaryote genome evolution

To detect HGT events, we first created pangenomes for 8,790 species based on 78,315 high-quality, single-isolate genomes²⁰. The resulting 41 million genes were clustered on minimum 80% nucleotide identity and minimum 50% sequence overlap into 22 million clusters, 961,821 (4.4%) of which covered more than 5 species. For each such gene cluster, reconciliation with the species tree based on 40 universal single-copy marker genes²⁰ was performed using RANGER-DTL²³ (Fig. 1 and Methods), resulting in 2.4 million well-supported unique transfer events that involved 8,756 species and 1.7 million species pairs (4.4% of all possible species pairs). Previous studies considering trends in HGT based on thousands of genomes have focused on transfers involving gene pairs with ≥99% nucleotide identity^11,13,15,16. Such gene pairs comprised 3.1% of detected events in our dataset (see right peripheral histograms in Fig. 2a and Extended Data Fig. 1a for the distribution of HGT events across gene distances). By using tree reconciliation for HGT event detection, we obtained an extended set of gene transfers that allowed us to compare whether transfers that happened earlier in evolution were subjected to the same trends as very recent transfer events. Nevertheless, we observed fewer transferred gene pairs with gene tree distances exceeding 0.6 (Fig. 2a and Extended Data Fig. 1a). Older transfers were more difficult to detect with high confidence and were thus less likely to pass our conservative thresholds for HGT event detection (Methods).

**Fig. 1: Global-scale computational detection of HGT events and dataset integration with relative abundance profiles from over a million environmental samples.**

**Fig. 2: Genes participating in HGT are mostly accessory and display distinct functional profiles depending on time passed since transfer.**

At least one transfer event was detected in 634,352 gene trees out of 961,821 (~66%). The fraction of transferred genes varied between species. For example, a transfer event was detected in 61.5% of the genes considered for Acinetobacter baumannii, but only in 19.8% for Listeria monocytogenes. Across all species, this resulted in an average of 42.5% (interquartile range, 35.9–50.5%) of genes per species affected by HGT. This number is lower than previously reported estimates of an average of 73% (ref. ¹⁷) and 81% (ref. ⁴) genes per genome affected by HGT. This discrepancy is probably because we use a stricter threshold to cluster sequences: 80% nucleotide identity as opposed to 30% (ref. ¹⁷) and 25% (ref. ⁴) amino acid identity. Therefore, we do not capture the oldest transfers considered in these studies^4,17 but we are able to assess HGT in a much larger dataset and look at more recent transfers.

We observed no association (r = 0.01, P_Pearson = 0.17) between the average fraction of transferred genes per species and the number of genomes used for generating the pangenome (Extended Data Fig. 2a). The average fraction of transferred genes was therefore not notably skewed towards better-studied species. Interestingly, the average fraction of transferred genes per species was weakly positively correlated (r = 0.18, P_Pearson = 7.0 × 10⁻⁶⁴) with the number of genes in the genome (Extended Data Fig. 2b). Previous studies comparing closely related prokaryotic genomes of different sizes have found evidence that HGT is the driving force behind genome expansion, which leads to larger genomes containing a higher fraction of transferred genes^1,2.

Previous studies have shown host-associated species to exchange more genes than those found in water or soil^11,13,14, leading us to next investigate the interspecies variability in gene transfer rates from this perspective. We mapped the species in our dataset to operational taxonomic units (OTUs) in the MicrobeAtlas database (Fig. 1), assigning ‘preferred’ habitats based on the highest average relative abundances. Restricting the analysis to transfers concerning gene pairs with ≥98% nucleotide identity, we indeed observed the highest median fraction of transferred genes in animal-associated species (1.32%). Plant-associated species had the second highest median fraction of transferred genes (0.46%), followed by soil-associated (0.16%) and finally water-associated (0.10%) species (Extended Data Fig. 2c). In contrast, when considering all transfer events in the dataset, we found no significant difference between animal-associated, water-associated and soil-associated species (Extended Data Fig. 2d). These findings indicate that on longer evolutionary scales, the loss of transferred genes may compensate for the higher rate of HGT in animal-associated species. Alternatively, animal-associated species may disappear at higher rates, possibly as a result of their host species going extinct.

Enrichment of accessory genes in recent transfers

We next focused on the distribution of our dataset with respect to species and gene distance (see Fig. 2a for all gene pairs in the dataset and Extended Data Fig. 1a for one representative gene pair per transfer event). The majority of gene pairs in the dataset originated from closely related species (species distance <0.3; Fig. 2a and Extended Data Fig. 1a, top histograms). However, gene pairs with transfer events were more likely to originate from distantly related species when compared with gene pairs without transfer events, especially after subsampling gene pairs with and without transfer events to follow the same gene distance distribution (P_{Mann–Whitney U} ≤ 2.2 × 10⁻¹⁶; Fig. 2b and Extended Data Fig. 1b). In addition, we observed generally lower gene distances in gene pairs with detected transfers, especially after subsampling gene pairs with and without transfer events to follow the same species distance distribution (P_{Mann–Whitney U} ≤ 2.2 × 10⁻¹⁶; Fig. 2c and Extended Data Fig. 1c). These results confirmed that transferred genes are more similar than expected based on species similarity.

The extent of HGT and resulting within-species gene content variation leads to a common distinction between core genes (present in all genomes of a species) and accessory genes (present in some genomes of a species)²⁴. Therefore, we studied the ubiquity of genes, namely, how often the transferred genes are found within the pangenome of a species, based on previously defined thresholds for extended core, intermediate-frequency accessory (shell) and low-frequency accessory (cloud) genes²⁵ (Fig. 2d). We observed that the odds of encountering a transferred gene within cloud genes were over twice as high as encountering a non-transferred gene within cloud genes (odds ratio = 2.07 in putative recipient species and 2.87 in putative donor species, P_Fisher ≤ 2.2 × 10⁻¹⁶ in both cases; Fig. 2d and Extended Data Fig. 3a). In contrast, the odds of encountering a transferred gene within the extended core genes were over twice as low as encountering a non-transferred gene within the extended core genes (odds ratio = 0.46 in putative recipient species and 0.37 in putative donor species, P_Fisher ≤ 2.2 × 10⁻¹⁶ in both cases). We next used gene distance as a proxy for time since the transfer event because genes transferred earlier in evolution have had more time to accumulate mutations and diverge from the donor. Interestingly, we observed higher fractions of extended core genes in older transfers (Extended Data Fig. 3b), implying persistence of a subset of transferred genes during species evolution. However, core gene sequences may produce more reliable trees than accessory genes (and, indeed, are used for building species trees²⁶), increasing the chances of detecting old transfers with high confidence. These results need to be interpreted with caution but they are congruent with the two-class model of gene evolution⁶, whereby genes with high turnover rates can be recruited to perform biological functions with long-term benefit. Such genes then switch to the second, slowly evolving and persistent, class.

Functional repertoires of recent and old transfer events

Multiple studies have considered the function of transferred genes^{9,13,15,18,27,28,29,30,31,32,33,34}, which, in the context of very recent transfers, has been shown to be predictive of HGT events¹⁵. To explore further, we divided our landscape of detected transfer events into bins based on species and gene distance and performed functional enrichment analysis for each bin using the Clusters of Orthologous Genes (COG) categories from eggNOG³⁵ (Fig. 2e and Methods), with gene distance again acting as a proxy for time since the transfer event. Recent transfers were enriched for genes participating in defence mechanisms, intracellular trafficking, cell cycle control, transcription, replication and repair, the mobilome, and genes of unknown function (Figs. 2f and 3 and Extended Data Fig. 4). In contrast, genes involved in various metabolic functions were depleted in recent transfer events and enriched in older transfer events (Figs. 2f and 3 and Extended Data Fig. 4). Finally, we found an overall depletion of transfers in genes involved in signal transduction, cell wall biogenesis and cell motility (Figs. 2f and 3 and Extended Data Fig. 4). To validate our findings with another system of functional annotation, we repeated the analysis using KEGG pathways³⁶ (Fig. 3 and Extended Data Fig. 5). As most pathways considered were associated with metabolic function, we observed a similar trend of significant depletion in recent transfers and enrichment in older transfers, although the latter was not always statistically significant (Fig. 3 and Extended Data Fig. 5).

**Fig. 3: Comparison of functional enrichment analysis results to those from previous studies.**

The use of various methods for defining transferred genes, different functional annotation systems and choice of background expectation complicate direct comparison between different studies. Moreover, in contrast to most previous studies, we performed functional enrichment analysis separately for gene and species pairs of varying degrees of divergence to prevent recent transfers between closely related species from dominating the enrichment results (Fig. 2e, bottom left). Nevertheless, we were able to select ten previous studies on HGT that performed functional enrichment analysis and compared their results with our observations from recent transfers (gene distance bins c, d, and e; Fig. 3, Supplementary Table 1 and Methods).

Notably, of the broad functional categories analysed for enrichment or depletion in HGT, 12 categories showed over 80% agreement in direction across studies. In cases where there was disagreement between studies (and/or with our study), some of the differences might reflect variations in how the functional categories were defined, or which gene families were particularly amenable to HGT detection. For example, there was low overall consensus in some categories related to information storage and processing, which has been previously discussed to be depleted in HGT^28,30,33. For such categories, it may be worth looking at the processes with a more fine-grained resolution. For example, in transcription, a case can be made for comparing genes involved in transcription regulation separately from other genes involved in transcription, as these appear to be more consistently enriched in transfers (Fig. 3). Furthermore, as genes can occasionally be transferred together with neighbouring genes on the chromosome, functional classification systems that pay increased attention to operon structures might be particularly suitable in interpreting large-scale HGT trends.

Antimicrobial resistance genes have been previously observed to be transferred at high rates^{11,13,14,15,16}. Therefore, we focused on genes annotated as such by KEGG. The most recent transfers displayed an over threefold enrichment in genes conferring resistance to β-lactams, aminoglycosides, tetracyclines, macrolides, phenicols and rifamycins (Extended Data Fig. 6). The degree of enrichment increased with species distance, suggesting that aggressive environmental selection for antimicrobial resistance can help overcome mechanistic barriers to HGT²² between distantly related species. Apart from the most recent transfers, however, we generally observed a depletion in transfers or no significant signal. The low degree of divergence between antimicrobial resistance genes shared via HGT could indicate transfer event recency but could also stem from strong evolutionary selection acting on these genes. Unfortunately, we are unable to distinguish whether these transfers occurred before or after widespread antibiotic usage, with previous estimates indicating nearly identical genes to have been transferred at any point in the last 1,000 (ref. ¹³) or 10,000 (ref. ¹⁶) years.

Co-occurring species are more likely to transfer genes

We then studied the species participating in HGT and possible associated ecological factors. By using the MicrobeAtlas database, an extensive collection of environmental samples mapped to the same 16S rRNA gene reference collection, we were able to observe the presence of two taxa within the same environmental sample and directly calculate co-occurrence rates. After mapping our dataset to OTUs in the MicrobeAtlas database, we observed a positive correlation between co-occurrence and the number of genes transferred for most OTUs (Fig. 4a, step 1 and precorrection histogram). However, genetic similarity has been shown to influence the success of HGT^{11,13,15,16,31}. Indeed, the number of transferred genes negatively correlated with the phylogenetic distance between the OTU pair (Fig. 4a, step 2). In addition, we observed a decrease in co-occurrence with increasing phylogenetic distance, in accordance with closely related taxa preferring similar environments³⁷ (Fig. 4a, step 2). We thus sought to correct for the phylogenetic signal in our observations on HGT and co-occurrence.

**Fig. 4: Co-occurring species are more likely to participate in HGT.**

To this end, we modelled the relationship between co-occurrence and phylogenetic distance using the power law equation (Fig. 4a, step 3). Upon comparing model residuals on co-occurrence with the number of genes transferred, the positive correlation remained for most OTUs (Fig. 4a, steps 4 and 5 and postcorrection histogram). As a complementary approach, we compared species pairs with multiple (seven or more) genes transferred with those with at most one gene transferred with respect to their phylogenetic distance and co-occurrence (Fig. 4b). After normalizing for differences in phylogenetic distance distribution, we observed that pairs of species with multiple transferred genes were significantly more likely to co-occur than pairs of species with at most one transferred gene (P_{Mann–Whitney U} ≤ 2.2 × 10⁻¹⁶; Fig. 4c). Correspondingly, when comparing species pairs with similar degrees of co-occurrence, we observed that pairs with multiple transferred genes were more likely to be closely related (P_{Mann–Whitney U} ≤ 2.2 × 10⁻¹⁶; Fig. 4d).

Observing a positive relationship between co-occurrence and the number of genes transferred, we next asked whether co-occurring species also need to interact to increase the chance of a successful HGT event. Predicting ecological interactions between two species based on co-occurrence can result in spurious associations arising from shared habitats, batch effects or interactions of both considered species with a third intermediary species. To correct for these effects, we used FlashWeave to generate a network of predicted ecological interactions³⁸. In brief, FlashWeave uses a Bayesian network-learning approach and interleaved conditional testing to heuristically adjust the associations for potential confounders (Methods). After generating the network, we compared the number of predicted ecological interactions between species pairs with multiple genes transferred and species pairs with at most one gene transferred, subsampling these two groups to follow the same phylogenetic distance and co-occurrence distributions. Within this data subset, we observed 1,012 interactions detected between species pairs with multiple genes transferred and 571 interactions between species pairs with at most one gene transferred, a 1.8-fold enrichment. In contrast, generating a network without conditional testing to remove putative spurious associations yielded 12,525 and 10,071 interactions respectively, a 1.2-fold enrichment. This suggests that there is a notable contribution of ecological interactions to HGT in an environment in addition to mere co-occurrence but further research that considers a larger number of interactions is needed.

Species with high abundance are more likely to transfer genes

We used the MicrobeAtlas database not only to look at presence or absence data but also to compare relative abundance profiles across over a million environmental samples. So far, only one study¹⁶ has looked into the relationship between HGT and species abundance, concluding that abundant bacteria are more likely to transfer genes to other abundant bacteria within the human gut. Unlike this previously mentioned study, we do not possess directly matched cultured isolate genomes with their relative abundance in the corresponding environmental sample, but we can determine whether a species is generally found in high or low abundances within a particular environment. Therefore, we assigned each OTU to its preferred habitat and compared HGT in OTUs lying on opposite ends of the environment’s OTU abundance distribution (Extended Data Fig. 7).

We observed a higher fraction of high-abundance OTU pairs participating in HGT when comparing pairs with similar phylogenetic distance (Fig. 5a). Interestingly, the increase in HGT probability with respect to abundance was higher in animal- and plant-associated microorganisms, and was significant in all pairwise comparisons of high–high, high–low and low–low abundance OTUs (Fig. 5a and Extended Data Table 1). In water- and soil-associated microorganisms, the increase in HGT probability was less apparent and not always significant (Fig. 5a and Extended Data Table 1). As HGT mechanisms often require physical proximity between cells exchanging DNA²², high-abundance species have more opportunities for transfer, assuming a well-mixed environment. The stronger signal observed in animal- and plant-associated organisms, however, indicates a role for host-associated factors in HGT.

**Fig. 5: Comparing relative abundance profiles across different environments shows different patterns of HGT for abundant versus rare species and generalists versus specialists.**

Finally, we defined an index of species generalism based on the relative abundance measurements of OTUs across different environments (Methods). Generalist species can thrive within a wide range of environments, whereas specialist species are confined to a particular environment. Our expectation was therefore that OTUs high on the generalism index can more easily disperse between environments, creating more opportunities for inter-environmental HGT. To this end, we selected 200 OTUs with the highest (generalists) and lowest (specialists) generalism index and compared the number of OTU pairs with at least one transfer event (Fig. 5b). Compared with all species, generalists showed a lower s.d. (Z-score, the number of standard deviations from the mean of the corresponding statistic calculated based on all species, = −10.5), lower range (Z-score = −6.56) and higher mean (Z-score = 2.67) of inter-environmental transfer rates. In contrast, specialists showed a higher s.d. (Z-score = 37) and higher range (Z-score = 31.4) of inter-environmental transfer rates. Interestingly, we observed a much higher rate of HGT between animal-associated specialists when compared with any other environmental and generalism index combination.

Conclusion

HGT is extensive and a fundamental driving force in prokaryotic genome evolution. In this study, we performed large-scale computational detection of HGT and integrated these data with an extensive microbial ecology dataset. In our dataset, an average of 42.5% genes in the genome were at one point affected by HGT. Most transferred genes were accessory and probably subjected to high turnover rates. Nevertheless, a fraction of genes transferred earlier in evolution managed to persist and become part of the extended core genome of the species. We have shown that such genes transferred earlier in evolution are enriched for metabolic functions. In contrast, genes transferred most recently are enriched for defence mechanisms and antimicrobial resistance. When considering previous knowledge on HGT and gene function, we show that 9 of 21 COG categories display no consistent signal across studies, suggesting additional factors at play.

Using the MicrobeAtlas database, we followed the global distribution of species that participated in HGT. Even after correction for the confounding effect of phylogenetic relatedness, species co-occurrence rates were positively correlated with larger numbers of transferred genes. In addition, we have shown that species interactions, abundance and dispersal affect HGT rates, indicating the importance of cell proximity for creating opportunities to transfer genes. These ecological factors could not have been assessed on such a global scale with previously available data, showing the value of the MicrobeAtlas database in describing high-level trends in microbial ecology and evolution.

Methods

Genome selection and pangenome generation

We based our analysis on the proGenomes v.2.2 dataset containing 82,400 genomes grouped into 11,562 species (that is, specI clusters) that were defined based on 40 single-copy marker genes²⁰. The corresponding species tree generated based on concatenated marker gene sequences was kindly provided by the authors of the proGenomes article²⁰.

From this initial selection, we filtered out metagenome-assembled genomes, single-amplified genomes, genomes flagged as chimeric by GUNC³⁹, genomes that were not taxonomically cohesive with the rest of the specI cluster according to GTDB²⁶, genomes with no 16S rRNA gene sequence detected and genomes we could not confidently map to the MicrobeAtlas database (see ‘Mapping genomes to MicrobeAtlas database OTUs’ below). The species tree was pruned to remove these genomes using the ETE Toolkit v.3 (ref. ⁴⁰). As a result, we obtained 78,315 genomes grouped into 8,790 species. For each species, a pangenome was generated by clustering all gene sequences on 95% nucleotide sequence identity as described in ref. ⁴¹.

HGT event detection

All gene sequences were clustered using MMseqs2 (ref. ⁴²) with minimum overlap of 50%, minimum identity threshold of 80% and clustering mode 0. The rest of the parameters were left as default. For each gene cluster, whenever sequences originated from more than one genome within a species, we only retained sequences that were most similar to those from other species within the gene cluster. We then proceeded with gene clusters containing sequences from at least five different species. Sequences were then aligned using the automatic strategy selection option in MAFFT v.7.471 (ref. ⁴³), with all other parameters left as default. On the basis of the multiple sequence alignment, a gene tree was generated using FastTree v.2.1.11 (ref. ⁴⁴) using the generalised time-reversible model⁴⁵ of nucleotide evolution, with all parameters left as default.

Before performing tree reconciliation, we subsampled the species tree using ETE Toolkit v.3 (ref. ⁴⁰) to decrease computational requirements in the following manner: for each gene cluster, the species tree node corresponding to the last common ancestor of all species within the gene cluster was selected. Clades within the species tree not containing any genes from the gene cluster were collapsed for computational efficiency. Subsequently, the subsampled species tree was used to root the gene tree using the OptRoot module from RANGER-DTL v.2.0 (ref. ²³). We then ran RANGER-DTL with default settings to perform gene and species tree reconciliation for a total of 500×. Gene clusters in which more than 50 optimal roots were detected were not considered further. Reconciliations from each optimal root were aggregated using the AggregateRanger_recipient module from RANGER-DTL v.2.0. We used a custom script to aggregate results across optimal roots and detect tree nodes that were labelled as transfers. For downstream analysis, we considered only transfer events detected in ≥80% reconciliations that contained gene pairs with ≥0.5 minimum branch support in the gene tree. In addition, all multifurcations containing 100% identical genes from different species were considered to be transfer events.

Calculating the average fraction of genes transferred

For each genome, we counted a gene as having undergone transfer as long as its pangenome-representative gene was involved in a transfer event. For the denominator (that is, total number of genes assessed), we only considered genes if their pangenome-representative genes had passed all steps described above in ‘HGT event detection’. The number of genes transferred was then divided by the total number of genes assessed and the average based on all genomes within a species was calculated. For the examples mentioned in the main text, we used data from specI_v3_Cluster259 for A. baumannii and data from specI_v3_Cluster712 for L. monocytogenes.

MicrobeAtlas data retrieval

The NCBI Sequence Read Archive⁴⁶ was searched for samples and studies containing any of the keywords ‘metagenomic’, ‘microb*’, ‘bacteria’ or ‘archaea’ in their metadata. The corresponding raw sequence data (as of 7 March 2020) were downloaded and quality filtered. To assign OTU labels, quality filtered data were mapped to MAPref v.2.2.1 using MAPseq v.1.0 at a ≥0.5 confidence level⁴⁷. We then filtered out samples containing less than 1,000 reads and/or less than 20 OTUs defined at 97% 16S rRNA gene identity and retained samples with at least 90% community coverage (calculated based on the formula in ref. ⁴⁸).

NCBI Sequence Read Archive sample metadata were parsed to classify every sample into four general environments: animal, aquatic, plant and soil. Subsequently, we calculated Bray–Curtis distances between all samples in the dataset and compared community compositions in samples from independent studies. When a sample was consistently similar to samples assigned to a different environment, we adjusted its environment label. In cases where samples with similar community compositions had no general agreement between assigned environments, we removed the environmental label.

Mapping genomes to MicrobeAtlas database OTUs

We used barrnap⁴⁹ with default settings to predict 16S rRNA gene sequences in the genome selection, proceeding with sequences of ≥50% of expected length. The sequences were then mapped to MAPref v.2.2.1 using MAPseq v.1.0 (ref. ⁴⁷), retaining only sequences that mapped to an OTU with a ≥0.3 confidence level. Genomes containing multiple 16S rRNA gene copies were mapped to OTUs based on a majority rule (≥50% copies) or high confidence (at least one copy with a 0.98 confidence level). Species containing multiple genomes were mapped to OTUs based on majority (≥50% genomes).

Preferred habitat assignment

For each OTU within the dataset, the average abundance was calculated separately for all samples assigned to the animal, aquatic, plant and soil environments. The OTU was then assigned to its preferred environment based on the highest of the four numbers.

Gene and species distance normalization

Distances between gene and species pairs were extracted from the corresponding trees using the dist function in ETE Toolkit v.3 (ref. ⁴⁰). To plot the distribution in Fig. 2a, only gene pairs with ≥0.5 minimum branch support values and ≥50% sequence overlap within the multiple sequence alignment were considered. Gene pairs with and without transfer events were normalized with respect to species distance by splitting the species distance distributions into 80 bins and subsampling the group with the larger number of pairs in each bin (either ‘transfer detected’ or ‘no transfer detected’) to the number of pairs in the second group in the corresponding bin (either ‘no transfer detected’ or ‘transfer detected’). The same procedure was performed for normalizing gene pairs with and without transfer events with respect to gene distance. After normalization, the resulting distributions were compared using the two-sided Mann–Whitney U-test.

Pangenome analysis

To calculate gene ubiquity, we counted the number of genomes represented by a gene in each pangenome versus the total number of genomes in the species. For subsequent analysis, only species encompassing ten or more genomes were considered. We used previously defined thresholds²⁵ to distinguish extended core genes (≥90% gene ubiquity) and cloud genes (≤15% gene ubiquity). In the species pair participating in HGT, the species with the higher gene ubiquity was labelled as the putative donor, whereas the species with the lower gene ubiquity was labelled as the putative recipient. To compare extended core and cloud genes with or without transfer events, a two-sided Fisher’s exact test was performed.

Genome annotation and functional enrichment analysis

We used the COG category and KEGG pathway functional annotations provided by the proGenomes database after running eggNOG-mapper for eggNOG 5.0 (ref. ³⁵). Each gene cluster was annotated to the corresponding functional categories based on the union of all gene annotations within the cluster. To analyse genes associated with the mobilome, we looked up which terms corresponded to the ‘X—Mobilome: phages, transposons’ category in the database of COGs^50,51 (mobilome, curated, in Extended Data Fig. 4). In addition, we extracted terms that contained the following keywords in the annotations provided by the proGenomes database: ‘phage’, ‘transposon’, ‘transposase’, ‘transposition’, ‘transposable’, ‘mobile’, ‘mobilization’, ‘integrase’, ‘integration’, ‘plasmid’, ‘conjugative’, ‘conjugation’, ‘transformation’ and ‘competence’ (mobilome, uncurated, in Extended Data Fig. 4). To analyse genes associated with transcription regulation, we extracted terms from the transcription category that contained the following keywords in the annotations provided by the proGenomes database: ‘regulation’ and ‘regulator’ (transcription regulation, uncurated, in Extended Data Fig. 4). We calculated a functional category’s background expectation fraction by counting the total number of genes that passed the pipeline that were annotated to this category divided by the total number of genes that passed the pipeline.

For each detected transfer event, we calculated the average species and gene distance by taking all average pairwise distances between left descendants and right descendants of the transfer event (for gene distance calculations, only gene pairs with ≥50% sequence overlap were considered). The resulting distribution of species and gene distances can be seen in Fig. 2e. For functional enrichment analysis, minimum and maximum species and gene distance cut-offs were selected in such a way that there were no bins without observations, with the resulting area divided into thirds. We also looked specifically at transfer events at the 0.01 and 0.05 gene distance cut-offs (approximately ≥99% and ≥95% sequence identity, respectively) as these results would be more comparable to previous studies that detected HGT events based on nearly identical sequences. We then counted the number of transfer events annotated to each functional category divided by the total number of transfer events in the area. The observed fraction of events annotated to a specific function was then tested with a two-sided binomial test against the fraction of all genes on which the pipeline was run that were annotated to this function. Resulting P values were corrected for multiple testing using the Holm–Sidak method.

A similar procedure was performed using KEGG ortholog annotations, grouping them into KEGG pathway maps (09101–09145) for Extended Data Fig. 5 and antimicrobial resistance genes (BR:ko01504) for Extended Data Fig. 6.

Functional repertoire comparison with previous studies

We compared our functional enrichment analysis results with those from refs. ^{9,13,18,28,29,30,31,32,33,34}. In most of these studies, functional categories were based on the COG database, with the exception of ref. ¹³ (with categories based on the SEED⁵²) and refs. ^9,28 (both with categories based on TIGRFAMs⁵³). The mapping between COG categories and KEGG pathways (used in our study), SEED and TIGRFAMs can be found in Supplementary Table 1.

For our study, we considered enrichment data from the most recent transfers, that is, gene distance bins 0.00–0.01, 0.00–0.05 and 0.00–0.25. These three gene distance bins together with three species distance bins provided us with nine data points to consider for each functional category. We assigned a functional category to have strong evidence for enrichment or depletion in transfers if at least seven of the nine data points showed significant enrichment or depletion. We assigned a functional category to have weak evidence for enrichment or depletion in transfers if most data points showed enrichment or depletion but this was not always statistically significant.

For ref. ¹⁸, we considered the results depicted in Fig. 8d and Supplementary Table 13 of the article. We calculated the first and third quartiles of the HGT index using all genes in Supplementary Table 13. We assigned a functional category to have strong evidence for enrichment in transfers if the median HGT index from genes in this category was greater than the third quartile. We assigned a functional category to have strong evidence for depletion in transfers if the median HGT index from genes in this category was less than the first quartile. Only functional categories containing at least five genes were considered.

For ref. ³⁴, we considered the results depicted in Fig. 9 of the article. We considered only recent HGT events (≥99% nucleotide sequence identity). We assigned a functional category to have strong evidence for enrichment in transfers if the median recent HGTs in this category was greater than the third quartile. We assigned a functional category to have strong evidence for depletion in transfers if the median recent HGTs in this category was less than the first quartile.

For ref. ³², we considered the results depicted in Fig. 4a (HTgenes row) of the article. We considered a functional category to have strong evidence for enrichment or depletion in transfers if the observed-to-expected ratio of orthologous groups was significantly different from one.

For ref. ³¹, we considered the results depicted in Supplementary Fig. 7 of the article. We considered a functional category to have strong evidence for enrichment or depletion in transfers if the relative proportion of transferred genes was significantly over- or underrepresented when compared with the set of all bacterial genes.

For ref. ³⁰, we considered the results depicted in the first two columns of Table 3 of the article. We considered a functional category to be enriched in transfers if its relative transferability was higher than one, and to be depleted in transfers if its relative transferability was lower than one. We used a P value cut-off of 0.05 to distinguish strong and weak evidence for enrichment or depletion.

For ref. ³³, we considered the results depicted in Table 2 of the article. In the table, functional categories were listed that significantly differed from the background of all gene families. We used a P value cut-off of 0.05 to distinguish strong and weak evidence for enrichment or depletion.

For ref. ²⁹, we considered the results depicted in Fig. 4b of the article. We used Z-score cut-offs of 2 and −2 to distinguish strong and weak evidence for enrichment or depletion.

For ref. ¹³, we considered the results depicted in Supplementary File 6 (SEED level 1 and SEED level 2) of the article. We used a P value cut-off of 0.05 to distinguish strong and weak evidence for enrichment or depletion. We downweighted depletion evidence for the ‘transcription (regulatory)’ and ‘signal transduction’ categories as they both mapped to ‘regulation and cell signalling’ in the SEED. For COG categories that mapped to multiple categories in the SEED, we indicated evidence based on the consensus from these categories.

For ref. ²⁸, we considered the results depicted in Table 2 of the article. We downweighted depletion evidence for ‘cell cycle control and mitosis’ and ‘cell motility’ as they both mapped to the ‘cellular processes’ in TIGRFAMs. We also downweighted enrichment evidence for ‘carbohydrate transport and metabolism’ as there was no one-to-one mapping for this category.

For ref. ⁹, we considered the results depicted in Fig. 2 of the article. We considered a functional category to be enriched in transfers if the proportion of transferred genes was greater than 10%, and to be depleted in transfers if the proportion of transferred genes was less than 3%.

Co-occurrence analysis

An OTU was detected as present in a given sample if its relative abundance was at least 0.01%. To calculate the co-occurrence between two OTUs, we counted the number of samples in which both OTUs were present and divided it by the number of samples in which the less prevalent OTU was present. Phylogenetic distances between OTUs were retrieved from the MicrobeAtlas database 16S rRNA tree using the dist function in ETE Toolkit v.3 (ref. ⁴⁰).

For modelling the relationship between co-occurrence and phylogenetic distance, we only considered OTUs that exchanged at least 1 gene with 30 other OTUs and OTU pairs in which both OTUs were present in at least 20 environmental samples. The power law equation (1) is as follows:

$${{\rm{CO}} \approx k\times {\rm{PD}}^{a},}$$

(1)

where CO stands for co-occurrence, PD stands for phylogenetic distance, and k and a are parameters fitted using the nlstools package in R⁵⁴. Model residuals were then used to calculate Spearman correlations with the number of genes transferred. To generate the background distribution, the number of genes was shuffled before calculating Spearman correlations. The resulting distributions of Spearman correlations generated based on raw co-occurrence (precorrection), model residuals (postcorrection) or background were compared with each other using the two-sided Mann–Whitney U-test.

The analysis depicted in Fig. 4b–d has been performed using a similar set-up as described in ‘Gene and species distance normalization’. We used the ≥7 genes transferred cut-off to denote OTU pairs with many transfer events as this corresponded to the 80% quantile of OTU pairs with at least 1 gene transferred.

Interaction prediction and analysis

Global networks of predicted interactions were computed with FlashWeave v.0.19.0 (ref. ³⁸). This method uses the local-to-global learning approach⁵⁵ to learn the skeleton of a Bayesian network encoding putative ecological relationships between species adjusted for ecological or technical confounders. To this end, FlashWeave uses an interleaved testing scheme that (1) heuristically determines likely confounding variables for each pair of species (based on univariate associations and previous iterations of the algorithm), and (2) subsequently tests whether the focal association holds when conditioned on these candidate confounders.

The parameters used for running FlashWeave were as follows: sensitive = false, heterogeneous = true, and max_k = 3 (with confounder correction) or max_k = 0 (without confounder correction). With these settings, FlashWeave converts non-zero read counts to centred log-ratio-transformed values to account for compositionality and discretizes these values. Mutual information tests are then run on the discretized values. We used co-occurrence data from all 95,422 OTUs contained within the environmental sample dataset, filtering the resulting network for edges between the 4,380 OTUs for which transfer event data were generated. OTU pairs with a score higher than zero were considered as interacting. To normalize for differences in phylogenetic distance and co-occurrence distributions between species with at least seven genes transferred and species with zero or one gene transfer, the procedure described in ‘Gene and species distance normalization’ was performed with simultaneous subsampling on phylogenetic distance and co-occurrence for 80 × 80 bins.

Abundance analysis

We used the same relative abundance numbers as calculated in ‘Preferred habitat assignment’. For each OTU, we only considered its abundance within its preferred environment, denoting high-abundance OTUs as those whose abundance was above the 80% quantile in this environment. In contrast, we denoted low-abundance OTUs as those whose abundance was below the 20% quantile in this environment. OTU pairs were then sorted based on phylogenetic distance and the fraction of OTU pairs with at least one transfer event detected was calculated for each phylogenetic distance bin. Error bands were calculated using Bernoulli’s principle of uncertainty. Resulting fractions were then pairwise compared between the high–high, high–low and low–low groups using a one-sided Wilcoxon rank-sum test. Resulting P values were corrected for multiple testing using the Benjamini–Hochberg method.

Generalist and specialist analysis

We computed a generalism index for each OTU reflecting its environmental flexibility. This index was calculated based on the entropy of the OTU’s abundance values across the four major environments (animal, aquatic, soil and plant). OTUs with similar abundances across environments had higher entropy. OTUs with uneven abundances across environments (a higher abundance in one or a few of the environments compared with the rest) had lower entropy.

To compare inter-environmental transfers, we selected 200 OTUs assigned to each environment (see ‘Preferred habitat assignment’) that displayed the highest entropy (generalists) and 200 OTUs that displayed the lowest entropy (specialists). OTU pairs were then subsampled in such a way that phylogenetic distance distributions were equal between all environments and between generalists, specialists and all species. We then counted the fraction of OTU pairs with at least one transfer event detected. To generate the background expectation, OTU pairs from all species were subsampled to the target phylogenetic distance distribution 1,000×. We then fit a normal distribution to the generated data using the fitdistr function in R⁵⁶ to get an estimate of the expected mean, s.d. and range of transfer rates between different environments.

Data visualization

Data from Figs. 2 and 4b,c and Extended Data Figs. 1–6 were visualized using seaborn v.0.11.2 (ref. ⁵⁷) and matplotlib v.3.5.1 (ref. ⁵⁸) in Python v.3.7.4. Data from Figs. 3, 4a, and 5 and Extended Data Fig. 7 were visualized using ggplot2 v.3.3.5 (ref. ⁵⁹) in R v.4.1.1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The original data from proGenomes v.2 can be downloaded per genome or per specI cluster at http://progenomes2.embl.de. The MicrobeAtlas database is developed within the C.v.M. group and can be downloaded from https://microbeatlas.org/. For the study, we used a subset of an older version of MicrobeAtlas. This subset, along with all datasets generated and used during the study, can be downloaded from https://doi.org/10.6084/m9.figshare.22893632. Source data are provided with this paper.

Code availability

The scripts used for dataset generation and analysis can be accessed at https://github.com/marydmit/eco_evolutionary_factors_and_hgt.

References

Puigbò, P., Lobkovsky, A. E., Kristensen, D. M., Wolf, Y. I. & Koonin, E. V. Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes. BMC Biol. 12, 66 (2014).
Article PubMed PubMed Central Google Scholar
Treangen, T. J. & Rocha, E. P. C. Horizontal transfer, not duplication, drives the expansion of protein families in prokaryotes. PLoS Genet. 7, e1001284 (2011).
Article CAS PubMed PubMed Central Google Scholar
Dagan, T. & Martin, W. Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proc. Natl Acad. Sci. USA 104, 870–875 (2007).
Article CAS PubMed PubMed Central Google Scholar
Dagan, T., Artzy-Randrup, Y. & Martin, W. Modular networks and cumulative impact of lateral transfer in prokaryote genome evolution. Proc. Natl Acad. Sci. USA 105, 10039–10044 (2008).
Article CAS PubMed PubMed Central Google Scholar
Sorek, R. et al. Genome-wide experimental determination of barriers to horizontal gene transfer. Science 318, 1449–1452 (2007).
Article CAS PubMed Google Scholar
Wolf, Y. I., Makarova, K. S., Lobkovsky, A. E. & Koonin, E. V. Two fundamentally different classes of microbial genes. Nat. Microbiol. 2, 16208 (2016).
Article CAS PubMed Google Scholar
Sela, I., Wolf, Y. I. & Koonin, E. V. Theory of prokaryotic genome evolution. Proc. Natl Acad. Sci. USA 113, 11399–11407 (2016).
Article CAS PubMed PubMed Central Google Scholar
Ravenhall, M., Škunca, N., Lassalle, F. & Dessimoz, C. Inferring horizontal gene transfer. PLoS Comput. Biol. 11, e1004095 (2015).
Article PubMed PubMed Central Google Scholar
Nakamura, Y., Itoh, T., Matsuda, H. & Gojobori, T. Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nat. Genet. 36, 760–766 (2004).
Article CAS PubMed Google Scholar
Lawrence, J. G. & Ochman, H. Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol. 44, 383–397 (1997).
Article CAS PubMed Google Scholar
Smillie, C. S. et al. Ecology drives a global network of gene exchange connecting the human microbiome. Nature 480, 241–244 (2011).
Article CAS PubMed Google Scholar
Brito, I. L. et al. Mobile genes in the human microbiome are structured from global to individual scales. Nature 535, 435–439 (2016).
Article CAS PubMed PubMed Central Google Scholar
Sheinman, M. et al. Identical sequences found in distant genomes reveal frequent horizontal transfer across the bacterial domain. eLife 10, e62719 (2021).
Article CAS PubMed PubMed Central Google Scholar
Fondi, M. et al. ‘Every gene is everywhere but the environment selects’: global geolocalization of gene sharing in environmental samples through network analysis. Genome Biol. Evol. 8, 1388–1400 (2016).
Article PubMed PubMed Central Google Scholar
Zhou, H., Beltrán, J. F. & Brito, I. L. Functions predict horizontal gene transfer and the emergence of antibiotic resistance. Sci. Adv. 7, eabj5056 (2021).
Article CAS PubMed PubMed Central Google Scholar
Groussin, M. et al. Elevated rates of horizontal gene transfer in the industrialized human microbiome. Cell 184, 2053–2067.e18 (2021).
Article CAS PubMed Google Scholar
Kloesges, T., Popa, O., Martin, W. & Dagan, T. Networks of gene sharing among 329 proteobacterial genomes reveal differences in lateral gene transfer frequency at different phylogenetic depths. Mol. Biol. Evol. 28, 1057–1074 (2011).
Article CAS PubMed Google Scholar
Jeong, H., Arif, B., Caetano-Anollés, G., Kim, K. M. & Nasir, A. Horizontal gene transfer in human-associated microorganisms inferred by phylogenetic reconstruction and reconciliation. Sci. Rep. 9, 5953 (2019).
Article PubMed PubMed Central Google Scholar
Choi, Y. et al. HGTree v2.0: a comprehensive database update for horizontal gene transfer (HGT) events detected by the tree-reconciliation method. Nucleic Acids Res. 51, D1010–D1018 (2023).
Article CAS PubMed Google Scholar
Mende, D. R. et al. proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res. https://doi.org/10.1093/nar/gkz1002 (2019).
Khedkar, S. et al. Landscape of mobile genetic elements and their antibiotic resistance cargo in prokaryotic genomes. Nucleic Acids Res. 50, 3155–3168 (2022).
Article CAS PubMed PubMed Central Google Scholar
Thomas, C. M. & Nielsen, K. M. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat. Rev. Microbiol. 3, 711–721 (2005).
Article CAS PubMed Google Scholar
Bansal, M. S., Kellis, M., Kordi, M. & Kundu, S. RANGER-DTL 2.0: rigorous reconstruction of gene-family evolution by duplication, transfer and loss. Bioinformatics 34, 3214–3216 (2018).
Article CAS PubMed PubMed Central Google Scholar
McInerney, J. O., McNally, A. & O’Connell, M. J. Why prokaryotes have pangenomes. Nat. Microbiol 2, 17040 (2017).
Article CAS PubMed Google Scholar
Maistrenko, O. M. et al. Disentangling the impact of environmental and phylogenetic constraints on prokaryotic within-species diversity. ISME J. 14, 1247–1259 (2020).
Article PubMed PubMed Central Google Scholar
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Article CAS PubMed Google Scholar
N’Guessan, A., Brito, I. L., Serohijos, A. W. R. & Shapiro, B. J. Mobile gene sequence evolution within individual human gut microbiomes is better explained by gene-specific than host-specific selective pressures. Genome Biol. Evol. 13, evab142 (2021).
Article PubMed PubMed Central Google Scholar
Paquola, A. C. M. et al. Horizontal gene transfer building prokaryote genomes: genes related to exchange between cell and environment are frequently transferred. J. Mol. Evol. 86, 190–203 (2018).
Article CAS PubMed Google Scholar
Cordero, O. X. & Hogeweg, P. The impact of long-distance horizontal gene transfer on prokaryotic genome size. Proc. Natl Acad. Sci. USA 106, 21748–21753 (2009).
Article CAS PubMed PubMed Central Google Scholar
Cohen, O., Gophna, U. & Pupko, T. The complexity hypothesis revisited: connectivity rather than function constitutes a barrier to horizontal gene transfer. Mol. Biol. Evol. 28, 1481–1489 (2011).
Article CAS PubMed Google Scholar
Popa, O., Landan, G. & Dagan, T. Phylogenomic networks reveal limited phylogenetic range of lateral gene transfer by transduction. ISME J. 11, 543–554 (2017).
Article CAS PubMed Google Scholar
Oliveira, P. H., Touchon, M., Cury, J. & Rocha, E. P. C. The chromosomal organization of horizontal gene transfer in bacteria. Nat. Commun. 8, 841 (2017).
Article PubMed PubMed Central Google Scholar
Cohen, O. & Pupko, T. Inference and characterization of horizontally transferred gene families using stochastic mapping. Mol. Biol. Evol. 27, 703–713 (2010).
Article CAS PubMed Google Scholar
Song, W., Wemheuer, B., Zhang, S., Steensen, K. & Thomas, T. MetaCHIP: community-level horizontal gene transfer identification through the combination of best-match and phylogenetic approaches. Microbiome 7, 36 (2019).
Article PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).
Article CAS PubMed Google Scholar
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2017).
Article CAS PubMed Google Scholar
Tamames, J., Sánchez, P. D., Nikel, P. I. & Pedrós-Alió, C. Quantifying the relative importance of phylogeny and environmental preferences as drivers of gene content in prokaryotic microorganisms. Front. Microbiol. 7, 433 (2016).
Article PubMed PubMed Central Google Scholar
Tackmann, J., Matias Rodrigues, J. F. & von Mering, C. Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data. Cell Syst. 9, 286–296.e8 (2019).
Article CAS PubMed Google Scholar
Orakov, A. et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol. 22, 178 (2021).
Article CAS PubMed PubMed Central Google Scholar
Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).
Article CAS PubMed PubMed Central Google Scholar
Coelho, L. P. et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022).
Article CAS PubMed Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS PubMed Google Scholar
Katoh, K. & Standley, D. M. MAFFT Multiple Sequence Alignment Software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Article CAS PubMed PubMed Central Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Article PubMed PubMed Central Google Scholar
Tavaré, S. Some probabilistic and statistical problems on the analysis of DNA sequence. Lect. Math. Life Sci. 17, 57–86 (1986).
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res. 51, D29–D38 (2023).
Article CAS PubMed Google Scholar
Matias Rodrigues, J. F., Schmidt, T. S. B., Tackmann, J. & von Mering, C. MAPseq: highly efficient k-mer search with confidence estimates, for rRNA sequence analysis. Bioinformatics 33, 3808–3810 (2017).
Article PubMed PubMed Central Google Scholar
Chao, A. & Jost, L. Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size. Ecology 93, 2533–2547 (2012).
Article PubMed Google Scholar
Seemann, T. barrnap 0.9: rapid ribosomal RNA prediction (2018).
Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science 278, 631–637 (1997).
Article CAS PubMed Google Scholar
Galperin, M. Y., Makarova, K. S., Wolf, Y. I. & Koonin, E. V. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. 43, D261–D269 (2015).
Article CAS PubMed Google Scholar
Overbeek, R. et al. The SEED and the rapid annotation of microbial genomes using subsystems technology (RAST). Nucleic Acids Res. 42, D206–D214 (2014).
Article CAS PubMed Google Scholar
Haft, D. H. et al. TIGRFAMs and genome properties in 2013. Nucleic Acids Res. 41, D387–D395 (2013).
Article CAS PubMed Google Scholar
Baty, F. et al. A toolbox for nonlinear regression in R: the package nlstools. J. Stat. Softw. 66, 1–21 (2015).
Article Google Scholar
Aliferis, C. F., Statnikov, A., Tsamardinos, I., Mani, S. & Koutsoukos, X. D. Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: algorithms and empirical evaluation. J. Mach. Learn. Res. 11, 171–234 (2010).
Google Scholar
Delignette-Muller, M. L. & Dutang, C. fitdistrplus: an R package for fitting distributions. J. Stat. Softw. 64, 1–34 (2015).
Article Google Scholar
Waskom, M. L. seaborn: statistical data visualization. J. Open Source Softw. 6, 3021 (2021).
Article Google Scholar
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).

Download references

Acknowledgements

We thank D. R. Mende (Amsterdam UMC), T. S. B. Schmidt (EMBL), A. Orakov (EMBL) and O. M. Maistrenko (NIOZ) for providing data from proGenomes v.2 and for answering our questions about the dataset. We thank M. Robinson (UZH), S. Tiberi (University of Bologna) and P.-L. Germain (UZH and ETH Zurich) for statistical advice. Finally, we thank A. Cuscó (Fudan University) for providing comments on an earlier version of the article. M.D. and C.v.M. were supported by funding from the Swiss National Science Foundation through NCCR Microbiomes (51NF40_180575) and a project grant (number 310030-192569). J.T. and J.F.M.R. were partially supported through UZH core funding (to C.v.M.). L.P.C. was partially supported by the Shanghai Municipal Science and Technology Major Project (number 2018SHZDZX01) and the 111 Project (number B18015). J.H.-C. was supported by funding from MCIN/AEI/10.13039/501100011033/ and FEDER Una manera de hacer Europa (grant number PID2021-127210NB-I00).

Funding

Open access funding provided by University of Zurich.

Author information

These authors contributed equally: Luis Pedro Coelho, Christian von Mering.

Authors and Affiliations

Department of Molecular Life Sciences and Swiss Institute of Bioinformatics, University of Zürich, Zurich, Switzerland
Marija Dmitrijeva, Janko Tackmann, João Frederico Matias Rodrigues & Christian von Mering
Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zurich, Switzerland
Marija Dmitrijeva
Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM)–Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC), Campus de Montegancedo-UPM, Madrid, Spain
Jaime Huerta-Cepas
Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
Luis Pedro Coelho
Centre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology, Translational Research Institute, Woolloongabba, Queensland, Australia
Luis Pedro Coelho

Authors

Marija Dmitrijeva
View author publications
You can also search for this author in PubMed Google Scholar
Janko Tackmann
View author publications
You can also search for this author in PubMed Google Scholar
João Frederico Matias Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Jaime Huerta-Cepas
View author publications
You can also search for this author in PubMed Google Scholar
Luis Pedro Coelho
View author publications
You can also search for this author in PubMed Google Scholar
Christian von Mering
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.D., J.T., J.F.M.R., L.P.C. and C.v.M. conceived and designed the study. M.D., J.T., J.F.M.R., J.H.-C. and L.P.C. generated the data. M.D. performed the statistical analyses and generated the visualizations. L.P.C. and C.v.M. supervised the study. M.D. wrote the first draft of the article, with input from L.P.C. and C.v.M. All authors contributed to revising and editing the final article.

Corresponding authors

Correspondence to Luis Pedro Coelho or Christian von Mering.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Ecology & Evolution thanks Gregory Fournier, Devani Romero Picazo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Genes participating in HGT display lower gene distances and originate from more distant species when compared to non-transferred genes.

Each transfer event is represented by one randomly-chosen gene pair. (a) Two-dimensional histogram depicting distributions of the distance between genes in a pair (y-axis, right marginal histogram) against the distance between their corresponding species (x-axis, top marginal histogram), comparing gene pairs with (green, n = 2,385,585) and without transfers (brown, n = 3,042,429). Bins containing fewer than ten observations from each group are coloured in grey. (b) After normalizing for differences in gene distance distributions, the species distance distribution of gene pairs with transfers (green) is significantly different to that of gene pairs without transfers (brown) (two-sided Mann-Whitney U test, P ≤ 2.2×10⁻¹⁶, n = 2,175,683). (c) After normalizing for differences in species distance distributions, the gene distance distribution of gene pairs with transfers (green) is significantly different to that of gene pairs without transfers (brown) (two-sided Mann-Whitney U test, P ≤2.2×10⁻¹⁶, n = 2,040,229).

Extended Data Fig. 2 Analysis of factors contributing to interspecies variability in rates of HGT.

(a) The fraction of genes with a transfer event detected does not significantly correlate with the number of genomes used for generating the pangenome (two-sided Pearson’s product correlation coefficient 0.01, P = 0.17, n = 8,776 species). The solid purple line represents the regression line (least squares linear regression in the original space, which results in a non-linear relationship in the displayed log-linear plot). The shaded area represents a 95% confidence interval, estimated based on 1,000 bootstraps. (b) The fraction of genes with a transfer event detected correlates with the average number of genes in the genome of the corresponding species (two-sided Pearson’s product correlation coefficient 0.18, P = 7×10^–64, n = 8,776 species). The solid purple line represents the least squares linear regression line. The shaded area represents a 95% confidence interval, estimated based on 1,000 bootstraps. (c) The fraction of genes with a recent transfer event (that is involving gene pairs with ≥98% nucleotide identity) detected compared across species’ preferred environments, with *** indicating a significant difference in the fraction of genes with a recent transfer event (two-sided Mann-Whitney U test P ≤ 0.001, n = 2,080 animal, 3,047 aquatic, 2,415 plant, 1,153 soil species). (d) The fraction of genes with a transfer event detected compared across species’ preferred environments, with *** indicating a significant difference in the fraction of genes with a transfer event (two-sided Mann-Whitney U test P ≤ 0.001, n = 2,080 animal, 3,047 aquatic, 2,415 plant, 1,153 soil species). In all depicted boxplots, the line inside the box corresponds to the median. The lower and upper bounds of the box denote the first and the third quartile respectively. Whiskers extend to at most 1.5 times the interquartile range (IQR).

Source data

Extended Data Fig. 3 Analysis of gene ubiquity and its relationship to HGT.

(a) Distribution of gene ubiquity (expressed in the fraction of genomes in species with gene) in putative donor species for gene pairs with (green, n = 335,841) and without (brown, n = 40,450) transfers. (b) Comparing gene ubiquity of gene pairs participating in HGT across different gene distance bins. More recent transfer events are depicted in dark green, while old transfers are depicted in light green. The number of gene pairs per bin: 196,527 (0.0–0.2), 76,886 (0.2–0.4), 47,104 (0.4–0.6), 12,324 (0.6–0.8).

Extended Data Fig. 4 Functional enrichment within bins depicted in Fig. 2e for remaining COG categories not in Fig. 2f.

Species distance bins are 1 (0.08–0.72), 2 (0.72–1.36), and 3 (1.36–2.00). Gene distance bins are a (0.50–0.75), b (0.25–0.50), c (0.00–0.25), d (0.00–0.05), and e (0.00–0.01). The significance of enrichment (blue) or depletion (red) in the number of transfers is indicated within boxes (two-sided binomial test after multiple testing correction using the Holm-Sidak method: *** - P ≤ 0.001, ** - P ≤ 0.01, * - P ≤ 0.05, ns - not significant).

Source data

Extended Data Fig. 5 Functional enrichment within bins depicted in Fig. 2e for KEGG pathway maps.

Species distance bins are 1 (0.08–0.72), 2 (0.72–1.36), and 3 (1.36–2.00). Gene distance bins are a (0.50–0.75), b (0.25–0.50), c (0.00–0.25), d (0.00–0.05), and e (0.00–0.01). The significance of enrichment (blue) or depletion (red) in the number of transfers is indicated within boxes (two-sided binomial test after multiple testing correction using the Holm-Sidak method: *** - P ≤ 0.001, ** - P ≤ 0.01, * - P ≤ 0.05, ns - not significant).

Source data

Extended Data Fig. 6 Functional enrichment within bins depicted in Fig. 2e for the BRITE hierarchy ‘Antimicrobial resistance genes’.

Species distance bins are 1 (0.08–0.72), 2 (0.72–1.36), and 3 (1.36–2.00). Gene distance bins are a (0.50–0.75), b (0.25–0.50), c (0.00–0.25), d (0.00–0.05), and e (0.00–0.01). The significance of enrichment (blue) or depletion (red) in the number of transfers is indicated within boxes (two-sided binomial test after multiple testing correction using the Holm-Sidak method: *** - P ≤ 0.001, ** - P ≤ 0.01, * - P ≤ 0.05, ns - not significant).

Source data

Extended Data Fig. 7 Distribution of OTU abundances in four main environments in the MicrobeAtlas database.

Environments considered are: animal (red, n = 1,243 OTUs), aquatic (blue, n = 1,774 OTUs), plant (green, n = 775 OTUs) and soil (orange, n = 588 OTUs). Vertical lines depict the 20% and 80% quantiles of the abundance distribution, with low-abundance OTUs lying to the left and high-abundance OTUs lying to the right.

Source data

Extended Data Table 1 FDR-corrected p-values from one-sided Wilcoxon rank-sum tests on the curves depicted in Fig. 5a

Full size table

Supplementary information

Supplementary Information

Supplementary Table 1.

Reporting Summary

Source data

Source Data Fig. 2

Statistical source data for Fig. 2f. For Fig. 2a–e, the source dataset is too large to include within the article; please refer to ‘Data availability’ statement.

Source Data Fig. 4

Statistical source data for Fig. 4a. For Fig. 4b–d, the source dataset is too large to include within the article; please refer to ‘Data availability’ statement.

Source Data Fig. 5

Statistical source data. The complete source dataset is too large to include within the article; please refer to ‘Data availability’ statement.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 4

Statistical source data.

Source Data Extended Data Fig. 5

Statistical source data.

Source Data Extended Data Fig. 6

Statistical source data.

Source Data Extended Data Fig. 7

Statistical source data.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Dmitrijeva, M., Tackmann, J., Matias Rodrigues, J.F. et al. A global survey of prokaryotic genomes reveals the eco-evolutionary pressures driving horizontal gene transfer. Nat Ecol Evol 8, 986–998 (2024). https://doi.org/10.1038/s41559-024-02357-0

Download citation

Received: 14 June 2023
Accepted: 05 February 2024
Published: 05 March 2024
Issue Date: May 2024
DOI: https://doi.org/10.1038/s41559-024-02357-0
Springer Nature Limited

A global survey of prokaryotic genomes reveals the eco-evolutionary pressures driving horizontal gene transfer

Abstract

Similar content being viewed by others

Main

Results and discussion

Extensive contribution of HGT to prokaryote genome evolution

Enrichment of accessory genes in recent transfers

Functional repertoires of recent and old transfer events

Co-occurring species are more likely to transfer genes

Species with high abundance are more likely to transfer genes

Conclusion

Methods

Genome selection and pangenome generation

HGT event detection

Calculating the average fraction of genes transferred

MicrobeAtlas data retrieval

Mapping genomes to MicrobeAtlas database OTUs

Preferred habitat assignment

Gene and species distance normalization

Pangenome analysis

Genome annotation and functional enrichment analysis

Functional repertoire comparison with previous studies

Co-occurrence analysis

Interaction prediction and analysis

Abundance analysis

Generalist and specialist analysis

Data visualization

Reporting summary

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Extended Data Fig. 4 Functional enrichment within bins depicted in Fig. 2e for remaining COG categories not in Fig. 2f.

Extended Data Fig. 5 Functional enrichment within bins depicted in Fig. 2e for KEGG pathway maps.

Extended Data Fig. 6 Functional enrichment within bins depicted in Fig. 2e for the BRITE hierarchy ‘Antimicrobial resistance genes’.

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation