Background

Ruminants have been an important part of human society for centuries [1, 2], providing us with a variety of economic products including meat, milk, and fur [3]. The unique digestive system of the multi-chambered stomach including the rumen, reticulum, omasum, and abomasum allows them to return the semi-digested food fibers to the mouth for further digestion [4]. This makes them well-suited to grazing on pastures and other types of land that are not suitable for other forms of agriculture [5].

Recent research has shown that the gastrointestinal tract (GIT) of ruminants contains a great diversity of prokaryotic and eukaryotic microorganisms [6,7,8,9]. Due to the various compositions of microorganisms among different locations, the microbiomes distributed in different sites play different roles but perform equally important physiological functions in ruminants’ survival [10, 11]. The GIT microbes enable the ruminants to digest lignocellulose and other plant feedstuffs [12, 13] and protect the animal host from harmful bacteria and other pathogens [14]. However, there can also be negative effects of the GIT microbiome on ruminant health and productivity. For example: a disruption in the microbial balance in the rumen can lead to the overproduction of lactic acid, which can lower the pH and cause ruminal acidosis, a common metabolic disorder in ruminants [15]. Precision regulation of the gastrointestinal tract microbiome in ruminants is crucial for improving animal health and productivity, and reducing the environmental impact of animal agriculture. Methane-producing archaea exist in the GIT of ruminants and are one of the main sources of greenhouse gases, which have been targeted for eradication or reduction [16, 17]. However, achieving precision regulation of methane-producing archaea or pathogenic bacteria in ruminants is not a simple task, as it requires a deep understanding of the complex interplay between the microbiome, diet, and host physiology. Currently, there is a lack of system tools available to enable precise manipulation of the microbiome in ruminants [18].

Bacteriophages (phages) are a critical component of the ruminant GIT microbiome and play crucial roles in shaping microbial composition [19]. In addition, phages hold great promise for the precision manipulation of the bacteriome (i.e., the bacterial and archaeal microbes) because of their narrow microbial-host range (i.e., often at species and even strain levels [20, 21]), providing alternative ways to suppress pathogenic bacterial/archaeal species [22] and control methane emissions [16, 23]. The lifestyles of viruses can be broadly classified into two categories: lytic and lysogenic. The separation of lytic phages is important in practical applications, as lytic phages are typically more convenient to work with and have more immediate applications, such as using phages as antimicrobials against bacterial infections in animals. For example, phages have been used to control bacterial infections in dairy cattle with mastitis, which is a common and costly disease in the dairy industry [24]. Despite tremendous success in identifying viruses from various environmental sources such as the ruminants rumen [25,26,27,28,29,30,31], human gut [21, 32,33,34,35,36,37,38], aquatic, terrestrial, plants, as well as other mammals (i.e., IMG/VR v3 [39]), the virome structure remains the “dark matter” in different ruminant GIT sites, especially those other than the rumen, compared with other environments. A comprehensive resource reference phage genome is required to further characterize the viral community of the ruminant GIT and enable genome-resolution research across ruminants.

Here, we present the Unified Ruminant Phage Catalogue (URPC), a comprehensive survey of phages in the gastrointestinal tracts of ruminants. Currently, the URPC contains 64,922 non-redundant phage genomes identified using 2333 bulk metagenomics sequencing samples from 18 published works (Table S1), covering ten gastrointestinal sites from eight different ruminant species. We found that 60.53% (n = 39,300) of phage genomes were novel compared with those in the public viral datasets, supporting the novelty of our dataset. We characterized the distributions of the phage genomes in different ruminants and GIT sections, as well as the lifestyles of the phages. Strikingly, we revealed that ~ 60% of the ruminant phages were lytic, which was the highest as compared with those in all other environments and certainly will facilitate their applications in microbial interventions. To further facilitate the future applications of the phage, we also constructed a comprehensive virus-bacteria/archaea interaction network and identified dozens of phages that may have lytic effects on methanogenic bacteria. Together, our URPC dataset represents a useful resource for future microbial interventions to improve ruminant production and ecological environmental qualities.

Methods

Data collection, quality control, and removal of host- and food-associated genomes

To perform a comprehensive search for phages of the ruminant gastrointestinal tract (GIT), publicly available sequencing reads of 2333 ruminant metagenomic samples were downloaded from the National Center for Biotechnology Information/NLM/NIH (NCBI) (Figure S1; Table S1), covering eight ruminants (buffalo, camel, cattle, cow, deer, goat, sheep, yak) and ten GIT sites (rumen, reticulum, omasum, abomasum, duodenum, jejunum, ileum, cecum, colon and rectum/feces) (Table S2). Raw reads were trimmed by Trimmomatic (v 0.39) [40] with the options ‘ILLUMINACLIP: TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:15 MINLEN:50 LEADING:3 TRAILING:3′. To decrease potential DNA contamination from the animal hosts, reads that could be aligned to their closest genomes from NCBI (Capra hircus, GCF_001704415.1; Bubalus bubalis, GCA_004794615.1; Camelus bactrianus, GCF_000767855.1; Camelus dromedarius, GCF_000803125.2; Bos taurus,GCF_002263795.1; Capra hircus, GCF_001704415.1; Alces alces, GCA_007570765.1; Cervus elaphus, GCF_910594005.1; Rangifer tarandus caribou, GCA_019903745.1; Capreolus capreolus,GCA_000751575.1; Ovis aries, GCF_016772045.1; Hydropotes inermis, GCA_020226075.1; Bos grunniens, GCA_005887515.2), and some food-associated genomes such as Glycine max, Zea Mays, and Medicago truncatula were filtered out using Bowtie2 (v 2.3.5.1) [41] with options ‘–very-sensitive’. The remaining paired reads were then used for further analyses.

Metagenomic assembly and viral contigs prediction

Unless otherwise stated, default parameters were used. Each sample was assembled using MEGAHIT (v 1.2.8) [42] with options ‘–min-contig-len 1000’. Assembled contigs of ≥ 1.5 kb in size were used to identify viral sequences using VirSorter2 (v 2.1) [43] with options ‘–include-groups “dsDNAphage, ssDNA” –min-score 0.7’ and VirFinder (v 1.1) [44] with default parameters. Contigs were identified as phages by both VirSorter2 and VirFinder (score ≥ 0.6 and p < 0.05).

Quality evaluation of phage genomes and dereplication of URPC datasets

The completeness of the viral contigs was estimated using CheckV (v 0.8.1) [45]. A total of 74,519 identified viral contigs with > 50% completeness were then selected and renamed according to their animal hosts. The sequences of these contigs were merged into a single file and dereplicated using CD-HIT [46] (v4.8.1, parameters: -c 0.95 -n 8) using a global identity threshold of 95%. The resulting non-redundant representative viral genomes consisted of a total of 64,922 viral populations (VPs) and were referred to as the Unified Ruminant Phage Catalogue (URPC).

Comparing the URPC genomes with public viral datasets

To estimate the proportion of novel phage genomes in the URPC genomes, the BLASTn tool (v 2.5.0) [47] was used to search all its sequences against a list of public viral databases including four public rumen virome datasets from the rumen virome database (RVD) [31], Hitch et al. [25] Solden et al. [26] and Friedersdorff et al. [27], NCBI viral Reference genomes, Release 201 (July 06, 2020), IMG/VR v3 [39], and four public human virome datasets such as GVD [33], GPD [32], MGV [21], and CHGV [34] (Table S3).

Average nucleotide identity (ANI) was calculated by merging the BLASTn hit regions with identity ≥ 90% and hit length ≥ 500 bp, then calculating the coverage of these regions. Based on the overall ANI, a viral sequence was considered to be novel if it has < 95% ANI as compared with other viral sequences.

Clustering viral contigs into viral clusters (VCs)

The clustering of viral contigs into viral clusters (VCs) was performed using a strategy adopted from the GPD [32]. Briefly, the BLASTn algorithm with default parameters was used to search the nucleotide sequences of the URPC genomes and the environmental viral sequences of various habitats (e.g., Terrestrial, Freshwater, and Plants) in the IMG/VR v3 database against themselves for homologous sequences. An E value threshold of 1E − 10 was first used to filter the BLASTn results; the BLASTn query-hit pairs were further filtered to retain those with a coverage > 70% on the larger genomes and a coverage > 90% on the smaller genomes. Here, the coverage was calculated by merging the aligned fraction length of BLASTn high-scoring pair (HSP) sequences that shared at least 90% nucleotide similarity. Finally, a Markov clustering algorithm (MCL v14-137) [48] was used with an inflation value of 6.0, which took the filtered BLASTn results as input, carried out graph-based clustering, and clustered the viral contigs into 55,635 VCs.

Prediction of viral lifestyles

The lifestyle classifications of all the URPC genomes were analyzed using DeePhage v1.0 [49] with the default parameters. DeePhage uses a scoring system to classify phage genomes into four categories, including temperate (with scores ≤ 0.3), uncertain temperate (0.3 ~ 0.5), uncertain virulent (0.5 ~ 0.7), and virulent (> 0.7). Higher scores indicate higher virulence. According to a benchmark study [50], DeePhage can classify short contigs from metagenomic data and has the best-reported performance on lifestyle prediction, while BACPHLIP [51] is only designed for complete phage genomes. And DeePhage has better generalization ability on novel phages by using a deep neural network to learn features from both DNA and protein sequences of phages, while BACPHLIP relies on a set of conserved protein domains that are associated with lysogeny. Therefore, we chose DeePhage to predict the phage lifestyles of URPC.

Taxonomic annotation of the URPC phages

To taxonomically classify the phage contigs, VirusTaxo (https://github.com/omics-lab/VirusTaxo, downloaded on 19th April 2022) [52] was used to compare the nucleotide sequences against those in the prebuilt database of VirusTaxo and assign them to a known viral genus at an entropy index threshold of < 0.5. A Demovir script (https://github.com/feargalr/Demovir; downloaded on 6th January 2022) was then used to predict family and order ranks for the remaining genomes by searching for viral marker genes at the amino acid level.

Co-diversification analysis of phages with their animal hosts

For all VCs that contained phages from three and more animal hosts, a phylogenetic analysis was performed. In total, 80 VCs were selected. First, Prokka v1.13 [53] (–kingdom Phages) was used to annotate the phage genomes for protein-coding genes. Pan-genome analysis was carried out for each of the VCs with Roary [54] to identify core genes and create a multiFASTA alignment of core genes using MAFFT [55] by using the multiFASTA alignment as input. A phylogenetic tree was then built using FastTree [56] v2.1.10 with default parameters. All the phylogenetic trees were then visualized and annotated using iTol [57]. For each VC, the branch length between any two phage genomes was calculated. Two single-tailed Wilcoxon rank sum tests were performed on the branch lengths from the same animal hosts and those from different animal hosts. The p value of the hypothesis that phages from different animal hosts had higher branch lengths was used to determine whether the phages significantly co-evolved or not co-evolved with their animal hosts.

Microbial host analysis of the URPC phages

To find putative microbial hosts for the URPC phages, ruminant metagenome-assembled genomes (MAGs) from four publications were downloaded, including the buffalo GIT [6], ruminants GIT [7], cattle rumen [58], and goat GIT (NCBI SRA database PRJNA723432). In addition, MAGs from the Global Microbial Gene Catalog (GMGC) [59] that covered 14 different habitats were also downloaded. To establish phage-microbial host relationships between these MAGs and the URPC phages, two bioinformatic methods were used which included the CRISPR-spacer matches and nucleotide sequence similarity searches. The CRISPR spacers of the MAGs were identified using CRT (v 1.2) [60] and MinCED (v 0.4.2, https://github.com/ctSkennerton/minced). The union of the CRISPR spacers was then aligned to the viral populations using BLASTn (v 2.5.0) [47] with options of ‘-word_size 10 -dust no -max_target_seqs 10,000’. Matches with mismatch ≤ 1 and alignment length > 95% spacer length were retained. In addition, BLASTn was used to compare the viral populations with the MAGs. A putative viral-host relationship could be established if their nucleotide sequences shared > 90% identity over > 500 bps.

Phylogenetic analysis of animal hosts

The phylogenetic analysis of the eight ruminant species was carried out using a method based on a previous study [61]. Briefly, the genomic sequences of the eight ruminant species were downloaded from the NCBI Genome database. Then, the universally conserved single-copy marker genes from each genome were identified using fetchMG [62]. The protein sequences of the markers were then aligned using MUSCLE [63] (-maxiters 100). To eliminate divergent regions from the resulting multiFASTA alignment, Gblocks [64] were used (parameters: -t = p -b3 = 8 -b4 = 2 -b5 = h). The maximum likelihood trees were built with RAxML [65] with default parameters.

Statistical analysis

All statistical analyses were conducted using R (v4.0.4) with a two-sided Wilcoxon rank sum test unless otherwise stated.

Results

A unified catalog of 64,922 phage genomes from the ruminant gastrointestinal tract

To provide a comprehensive overview of the phages associated with the gastrointestinal tract (GIT) of ruminants, we collected a total of 2333 metagenomic samples from 18 previously published research [6, 7, 13, 23, 26, 58, 66,67,68,69,70,71,72,73,74,75,76,77,78] (Table S1) that covered ten GIT sites from eight ruminant species, including (Fig. 1A and Tables S1 and S2). including buffalo (n = 745), cattle (n = 930), goat (n = 563), sheep (n = 133), deer (n = 115), yak (n = 50) and cow (n = 46). After quality filtering and removing host DNA sequences (“Methods” section), a total of 14.17 terabytes (Tb) of clean data with more than 33 million reads and 9 billion bases per sample were retained (Table S1). We assembled them into a total of 302,721,852 contigs using MEGAHIT [42], averaging 132,251 contigs per sample with an N50 length of 3836 (Table S4).

Fig. 1
figure 1

Reconstruction of the phage genomes from the ruminant gastrointestinal tract (GIT). A Generation of the Unified Ruminant Phage Catalogue (URPC) using 2333 GIT microbiome samples from ten GIT sites and eight ruminant species. The upper-left panel shows a graphical representation of the ruminant gastrointestinal tract (GIT), with arrows indicating the direction of food flow through the stomach. The GIT sites in this study are divided into ten sections. The bottom-left panel shows the number of samples taken from the GIT sites or sections of the ruminants. The top-right panel shows the rarefaction analysis of the unique number of VPs (Y-axis) as a function of collected samples (X-axis), while the bottom-right panel shows the statistics on the identified phages from each of the eight ruminant species, including the number, genome size and taxonomy. B Pie chart showing the distribution of estimated quality of the VPs in the URPC into quality tiers estimated by CheckV (complete, n = 6,035; high-quality, n = 3085; medium-quality, n = 55,802). Column chart showing the quality distribution of VPs in each animal host. C Pie chart showing the proportion of annotated VPs in the URPC at the family level by using VirusTaxo and Demovir (see “Methods)

To identify putative phage genomes, we screened the assembled contigs using a bioinformatics pipeline adopted from Luis et al. [32] (“Methods” section), followed by quality assessment for viral genome completeness using CheckV [45] and dereliction using CD-HIT [46]. We obtained a total of 74,519 viral contigs (mostly bacteriophages) with > 50% completeness and length > 1.5 kb, corresponding to 64,922 non-redundant viral populations (VPs), i.e., species-level clusters at an Average Nucleotide Identity (ANI) of 95%. We defined the latter (i.e., the 64,922 non-redundant VPs) as the Unified Ruminant Phage Catalogue (URPC). Among these, 6035 (9.29%), 3085 (4.75%), and 55,802 (85.95%) were classified as complete, high- and medium-quality, respectively, according to the CheckV tool (Fig. 1B; Table S5).

Previous studies have commonly employed a 5-kb threshold for identifying metagenome-based viral genomes [31, 39]. In our study, we opted for a 1.5-kb threshold. To substantiate this choice, we categorized all URPC phages into four length groups: < 5 kb (n = 777), 5 ~ 30 kb (n = 14,462), 30 ~ 60 kb (n = 35,199), and > 60 kb (n = 14,484). We first compared the qualities of phages in each group and were intrigued to discover that the < 5 kb group exhibited the highest proportion of complete (29.81%) phage genomes, as determined by CheckV, in comparison to the other three groups (Figure S2A). Moreover, taxonomic annotation, as per our methods, was successful for 91.48% of phages in the < 5 kb group, surpassing the rates observed in all other groups (Figure S2B). Therefore, our findings indicate that the utilization of short contigs (i.e., 1.5 ~ 5 k) not only aids in more accurately estimating the number of phages but also surprisingly enhances the annotation rates.

We used the rarefaction analysis to show that the saturation curve is far from plateaued, and more samples are required for the discovery of ruminant GIT phages (Fig. 1A). Similar trends were observed in human gut virome catalogs such as the metagenomic gut virus (MGV) [21] and a phage genome catalog of the Japanese [20]. Among all the animal hosts, we obtained the highest number of VPs (n = 33,156) in the buffalo, followed by the cattle (n = 10,589), goat (n = 8756), and sheep (n = 5876). The number of viruses identified varied in different GIT sites (Figure S3), which correlated with the number of samples we collected. The genome size and viral taxa vary among different animal hosts, indicating species-specific viral composition. Given the recent interest in human gut phageome, we then compared the genome length of URPC and other published metagenome-assembled human gut viral genomes, and found that URPC genomes were significantly longer than those in the human gut (p < 2.22e − 16, Wilcoxon Rank Sum test; Figure S4).

We annotated the VPs using VirusTaxo [52] and Demovir (https://github.com/feargalr/Demovir) (Fig. S5; Table S5) and assigned 74.69% of the VPs at the family level (Fig. 1C). Among the annotated VPs, 16,507 (25.42% of the total) belong to the Siphoviridae, followed by Poxviridae (n = 6327), Mimiviridae (n = 5409), Baculoviridae (n = 3962), Myoviridae (n = 3846), Podoviridae (n = 2360) and Microviridae (n = 2291). The overall taxonomic distribution, dominated by viral families such as Siphoviridae, Microviridae, Myoviridae, and Podoviridae, was consistent with other metagenome-derived viral catalogs in ruminant rumen (RVD) [31] and human gut [21, 33]. Particularly, we reannotated the viral genomes from RVD using our pipeline for taxonomic classification (see “Methods” section), and we found that the families Siphoviridae, Podoviridae, Myoviridae, Baculoviridae, and Myoviridae accounted for the majority of the viral genomes in both URPC and RVD datasets. However, we also identified more phages from the family Podoviridae than RVD (3.6% in URPC, and 0.5% in RVD) (Figure S6; Table S6) indicating that URPC expands the diversity of the ruminant gastrointestinal phage genomes.

We then examined the novelty of the URPC phage genomes by comparing them with several public viral databases including the NCBI viral Reference genomes (Release 201, Jul 06, 2020), IMG/VR v3 [39], four public rumen virome datasets [25,26,27, 31] and four human gut virome genome catalogs [21, 32,33,34] (Table S3). Applying an Average Nucleotide Identity (ANI) threshold of 95%, we observed that URPC exhibited the highest number of shared viral populations (VPs) with the RVD (Figure S7A). Notably, 28.11% of URPC genomes were found in the RVD, while 33.84% of RVD genomes were identified in URPC (Figure S7A). The substantial overlap between the two datasets can be attributed to the similar number of rumen samples used in URPC (826) compared to the RVD (975), despite variations in tools and criteria for viral contig identification in the latter [31] (refer to Table S7 for a detailed comparison). For a fair comparison, only 41,738 VPs from the RVD dataset meeting the same criteria as our dataset (i.e., completeness > 50%) were considered. At these criteria, this study identified a significantly higher number of VPs (64,145) compared to the RVD (Table S7).

Furthermore, with the inclusion of three additional public rumen phage datasets, a total of 46,668 (71.89%) URPC phages were determined to be novel at a 95% ANI threshold (Figure S7A), signifying URPC’s substantial contribution to expanding the ruminant gastrointestinal tract phage dataset despite prior outstanding works. When considering all the aforementioned public viral datasets, we found that 60.53% (n = 39,300) of VPs were considered novel at the 95% ANI threshold, indicating that the majority of URPC phages are novel (Figure S7B).

Organism-specific distribution of URPC genomes in animal hosts

To investigate the correlation between the composition of VPs and their animal hosts, we first calculated the distribution of the VPs in each animal host. We discovered that 99.91% (n = 64,863) of VPs had only one animal host (referred to as organism-specific from now on), while only a few (n = 59) appeared in two or three animal hosts (Fig. 2A). To evaluate the distribution of phages with their animal hosts under higher level, we clustered the VPs into viral clusters (VCs) using methods adopted from the GPD [32] (“Methods” section) and generated a total of 55,635 VCs. Among these, 99.06% of (n = 55,122) the VCs are organism-specific. Similarly, most (91.43%, n = 50,874) of the VCs were distributed only in one GIT sit (Fig. 2B). Among the 4761 VCs that were distributed in two or more GIT sites, 92.69% (n = 4413) came from the same animal host, indicating an organism-specific distribution of the phages in the animal hosts. Among the 80 “broad-range” VCs (presented in three or more animal hosts), most of their animal hosts were goats (n = 60), buffaloes (n = 55), cattle (n = 52), sheep (n = 51), and deer (n = 39), while there were fewer in yaks (n = 4), camels (n = 2), and cows (n = 1) (Fig. 2C), which might be due to fewer available GIT samples of the latter three animals. To find out whether these “broad-range” VCs were food-related, we also included the VPs from the IMG/VR v3 database and re-did the viral clustering. We found that two of the “broad-range” VCs could be clustered with phages found in the terrestrial, freshwater, and plants (Fig. 2C), which confirmed previous research that phages could be readily introduced into the rumen from water sources, as well as housing and farm infrastructure [19]. However, the origins of the other 78 broad-range VCs remained to be identified.

Fig. 2
figure 2

Distributions of the URPC phages in animal hosts and GIT sites. A The distribution of the URPC phages at the viral population (VP) level across the animal hosts. The UpSet pot shows the numbers of unique and shared VPs for the eight ruminant animals, while the bar chart shows the number of animal hosts for the VPs. B The distributions of the URPC phages at the viral cluster (VC) level across the animal hosts (left) and GIT sites (right). C An UpSet plot shows the overlaps between the broad-range phages (i.e., those that were found in two or more ruminant species) and the phage genomes collected in the IMG/VR v3 database (“Methods” section). D Top 50 most diverse VCs, ordered by their cluster size. The color and size of the VCs correspond to the number of animal hosts in which they were found. E Characterizations of the top 10 VCs, including the size, distributions in the GIT sites and animal hosts, genome size, and lifestyle. Lifestyles of phage in each VCs were predicted by DeePhage, which classified phages into four groups, including virulent (red), uncertain virulent (pink), uncertain temperate (light blue), and temperate (dark blue). Dark green and light green respectively indicate whether VCs are found in the public datasets

We next characterized the largest VCs (i.e., the VCs were ranked according to the number of containing VPs). 27 out of the top 50 had two or more animal hosts, suggesting that the diverse VCs were also the well-adapted ones; this trend was more apparent among the top 10 VCs (i.e., of these, eight were found in two and more animal hosts). In addition, we observed that most of the top 10 VCs consisted of lytic bacteriophages (Fig. 2E; the lifestyles were determined using a DeePhage tool; Methods). Most of the phages in the top 10 VCs were 30 ~ 60 kb in size, which was well within the size range of typical phage genomes; in contrast, crAssphages and Gubaphages were the most diverse in the human gut virome, which was significantly longer (~ 100 kb in size) [32, 79]. Interestingly, the majority of broad-range phages in the top 10 VCs were identified in the rumen (63%; Table S5), much higher than those identified in the rectum (including fecal samples; 17%), despite that, we had comparable samples from the two GIT sites (826 vs. 753, Fig. 1A). These results strongly suggested that at least some of the rumen phages were likely originated from the environment.

In summary, we found that most of the ruminant phages are organism- and GIT site-specific at both the VP and VC levels, with a few broad-range ones, likely originating from the environment.

Co-diversification of phages with their animal hosts

We next investigated whether the phage genomes in the broad-range VCs could show co-diversification patterns with their animal hosts. We focused on the 80 broad-range VCs that were presented in three or more animal hosts; for each of the VCs, we used a phylogenetic tree-based method to test whether phages from the same animal host were significantly closer than those of different animal hosts (“Methods” section). We observed that in 83.75% (n = 67) of the VCs, the phages tend to significantly co-evolute with their animal hosts (i.e., VPs from the same animal host in a VC were clustered together on the evolutionary tree and had significantly closer evolutionary distance), while only 2.5% were classified as significantly not co-evolution (Fig. 3A). We included in Fig. 3B–F afew typical examples to showcase our analysis. As shown in Fig. 3B, we identified two deer phages in VC_1167, which showed closer phylogenetic relationships to phages of other ruminants than to each other, indicating significant non-co-evolution. Conversely, Fig. 3C–F showed a few cases of significant co-evolution in VC_1341, VC_1220, VC_35, and VC_95, in which multiple phages from the same animal hosts often cluster together in their respective phylogenetic trees. Further efforts would be required to illustrate whether the co-evolution was due to the adaption of the microbial hosts of the phages to the ruminant species.

Fig. 3
figure 3

Co-evolution analysis of the broad-range VCs with the animal hosts. A Overall statistics of the co-evolution analysis. Here the density and bar plots show the likelihood (p values) distributions of the phages in the 80 broad-range VCs were co-evolved (red line and bars) or not co-evolved (orange line and bars). One-tailed Wilcoxon rank sum tests were performed on the branch lengths from the same animal hosts and different animal hosts (“Methods” section). The pie chart shows the proportion of coevolved and non-coevolved viral clusters (VCs) with three or more animal hosts. BF Example phylogenetic trees of VCs with their animal hosts in which the phages showed significant not co-evolution (B) or co-evolution (CF)

The URPC contains the highest proportion of lytic phages as compared with other environments

The observation that the lytic phages account for seven of the top 10 VCs encouraged us to further characterize the lifestyle of the ruminant phages. Because the metagenome samples we collected were not separated viral particles, we expected that many phages were derived from bacterial cells and were more likely to be integrated into the bacterial chromosomes as temperate phages. However, we found that the majority (59.60%, n = 38,696) of the phages were classified as lytic phages (virulent or uncertain virulent) using a DeePhage tool [49], which outperformed several existing tools in terms of accuracy. We found similar proportions of lytic phages in the eight ruminants (55.15 to 88.59%, with an interquartile range (IQR) of 55.77 to 61.86%; Fig. 4A) as well as across GIT sites (53.76% to 61.51%, with an IQR of 57.35 to 59.08%; Figure S8). Similarly, we also identified a comparable percentage of lytic phages in two rumen viral genome datasets (48.55% in RVD, and 52.43% in moose rumen [26]) that contained at least 200 phage genomes (Table S3), further supporting our findings.

Fig. 4
figure 4

Lifestyle analysis of the phages identified in the ruminant GIT and other environments. A Phage lifestyle analysis of the ruminant GIT, rumen, human gut, and other habitats in the IMG/VR v3 database. Due to the previous utilization of Virsorter1 [80] for viral identification, we reannotated the viral genomes from the moose rumen [26] using our viral identification pipeline (i.e., VirSorter2, VirFinder and CheckV; see “Methods” section), which has been marked with *. A DeePhage tool was used to analyze the phage lifestyles, which classified phages into four groups, including virulent (red), uncertain virulent (pink), uncertain temperate (light blue), and temperate (dark blue). The proportions next to the bar plots indicate the overall proportion of lytic phages (i.e., the virulent and uncertain virulent combined) in each dataset, while the numbers in the parentheses indicate the overall phage numbers in the corresponding datasets that passed our filtering criteria (i.e., CheckV completeness score > 50% and length > 1.5 k). B Comparisons in the phage lifestyles among the ruminants, Tanzania hunters, and the combination of public human gut virome datasets including the GPD, GVD, MGV, and CHGV. P values were calculated using the chi-square test

We then compared with public datasets and found surprisingly that phages from all other environments had lower proportions of lytic phages. Since the length and completeness of the virus affect the number of genes detected in the viral genome, we performed the same quality control consistent with our URPC (filtering length > 1.5 k and completeness > 50% estimated using CheckV [45]) on the viral genomes from public databases (Table S3). For example, 44.40% of phages in the IMG/VR v3, the most comprehensive phage database so far, are lytic, which is significantly lower than the URPC. We found the same results when stratifying the IMG/VR phages according to the habits (Fig. 4A). In addition, we found an overall of 32% lytic phages in several human gut virome datasets (24.02% to 40.81%, interquartile range (IQR): 28.00 to 38.03%; Table S3), consistent with previous observations that the human gut phages were mostly temperate [20, 32].

Interestingly, out of all the human gut virome/metagenome datasets we have analyzed, we found only one that contained a similar proportion of lytic phages to the URPC: the Tanzania hunter gut metagenome. As shown in Fig. 4B, 57.5% of the phages identified in the Tanzania hunter dataset were lytic, similar to that of the URPC (p = 0.61, chi-square test), while both were significantly different from the other human gut virome datasets (p < 0.01, chi-square test). We speculate that the different lifestyles between the Tanzania hunter and the other human samples might underlie the different phage lifestyles, such as the consumption of raw or less processed foods and high exposure to microbe-enriched environments of the hunters [81]. However, due to the limited numbers of samples (i.e., 40 metagenomic samples from the NCBI SRA database; PRJNA392180) and identified viral contigs (i.e., 40 non-redundant viral contigs with completeness above 50%), our hypothesis should be tested using larger datasets.

Bacterial and archaeal host prediction of the URPC phages identifies dozens of lytic phages targeting methane-producers

Predicting viral hosts is crucial for understanding their roles and impacts [82], and phages can serve as ideal tools to regulate ruminant GIT ecosystems by limiting the number of their microbial hosts through lytic infections [83]. We thus predicted hosts for all the VPs using metagenome-assembled genomes (MAGs) from public datasets [6, 7, 58, 59] using two different methods, namely the CRISPR-spacer- and sequence similarity-based methods (“Methods” section). We were able to assign a total of 9271 phages (14.28% out of the total) to their putative bacterial/archaeal hosts, including 4690 (50.59%) to a public ruminant GIT genome collection and 5562 (59.99%) to the MAGs in the Global Microbial Gene Catalog (GMGC [59]). Among these, 754 phages could be assigned consistently to the same hosts by both methods (e.g., the highly confident prediction results; Fig. 5A, Table S8). We observed little overlaps between the predicted virus-host connection pairs produced by the two methods, consistent with previous results [84, 85]. In general, a total of 7227 (77.95%) phages were classified as specialist (Fig. 5B), meaning that they infect only one genus (i.e., specialist phages), while the others were predicted to infect two or more genera (i.e., generalist phages), which confirmed previous research that phages have a limited host range [20, 86, 87].

Fig. 5
figure 5

Host prediction of the ruminant GIT phages and identification of lytic phages targeting methane producers. A Statistics on the viral-microbial host relationships using two different methods, namely CRISPR-spacer and sequence homology-(blastn) based methods. The UpSet plot shows the number of unique and shared viral-host interactions according to the two methods. The pie chart shows the proportion of phages whose host(s) could be predicted by these methods. B Histogram showing the number of phages (Y-axis) as a function of the number of predicted hosts at the genus level (X-axis). The phages could be divided into a specialist (number of host genus = 1) and a generalist (number of host genera > 1). C Characteristics of the phages stratified by their predicted microbial hosts at the phylum level, including the genome size, annotation rate, host specificity, and lifestyles. The lifestyles were predicted using DeePhage and classified into two groups (virulent: DeePhage score >  = 0.5, temperate: score < 0.5). D The interaction network between phages and methane-producers (i.e., archaea) predicted by phage-host relationships

Among all the predicted virus-microbial host relationships, Firmicutes (the combination of Firmicutes_A, Firmicutes, and Firmicutes_C) was the most common phylum targeted by the URPC phages (n = 5,214, i.e., the number of interacting phages), followed by Bacteroidetes (n = 3554) (Fig. 5C), which were the two groups of beneficial bacteria that were dominant in the ruminant GIT [88]. Many of the functionally important genera were targeted by the phages. At the genus level, the most predicted hosts were Prevotella (n = 1035; one of the most abundant and versatile genera that contribute to hemicellulose degradation, lignocellulose pretreatment, and feruloyl esterase activity [89]), followed by cellulose digestive Bacteroides (n = 625) which secrete cellulases and hemicellulases to degrade cellulose and hemicellulose into glucose and other sugars for ruminants [12, 90], Lachnospira (n = 524) and Roseburia (n = 381) major short-chain fatty acids (SCFAs) producers in the rumen providing energy and anti-inflammatory effects [12, 91, 92]. Our results suggested important regulatory roles of phages in the ruminant GIT microbial structures and functions.

Phage could be an ideal tool to inhibit the growth of methane-producing archaea in the GIT of ruminants [93]. However, no lytic phages targeting methane producers have been identified [16]. Here, we retrieved 109 phages that infected methanogenic archaea from the phage-microbial host analysis (Fig. 5D). Of these, 74 were lytic (virulent or uncertain-virulent; “Method” section) and could target the six genera of methanogens (i.e., ISO4, Methanobrevibacter, Methanobrevibacter_A, Methanobrevibacter_B, Methanocorpusculum, Methanosphaera) annotated by GTDB-Tk [94]. These results should facilitate targeted isolation of phages and experimental validation of their lysis efficiency against methanogenic archaea.

Discussion

Many ruminant animals are important livestock and have more complicated gastrointestinal tracts (GITs) than other mammals. It has been well established that the GIT microbiome plays important roles in not only feedstuff digestion and absorption [12, 14], but also the development, health, and diseases as well as the quality of animal products such as meat, milk, and fur [3]. So far, there has been significant progress in the study of ruminant microbiomes, particularly bacteria/archaea [6,7,8]. However, we still lack systematic tools to precisely manipulate the microbiomes to improve the wellness of the animals and the qualities of their products. Phages (bacteriophages and archaeal viruses), especially lytic ones, are ideal tools for such purposes because of their abundance in nature and high microbial host specificity [93]. However, there is still a lack of comprehensive research on ruminant phages, especially at GIT sites other than the rumen [25,26,27,28,29,30,31]. In this study, we filled this gap by mining 2333 metagenome samples from eight ruminant species, covering all major sites along the GIT (ten sites, including the rumen). Based on the data, we constructed a Unified Ruminant Phage Catalogue (URPC) comprising 64,922 phage genomes. Of which, 60.53% were novel as compared with public virome databases, indicating that the URPC represents a significant expansion to ruminant GIT phages and is the most comprehensive dataset so far.

We first examined the distributions of the URPC in the eight ruminants and across different GIT sites. Broad-range phages, i.e., those found in multiple hosts are of higher values because they could be applied to multiple animals, e.g., to kill pathogens. However, we found that most phages were organism-specific, which was expected given the results that the rarefaction curve was far from saturation (Fig. 1A) and consistent with the previous observations in humans that the gut virome was often individual-specific [20, 32]. Nevertheless, these results also indicate that we have a much larger pool of arsenals from which we can find phages targeting specific bacterial/archaeal species of interest.

Lytic phages often have higher application potentials because they are easier to isolate and more efficient in killing their microbial hosts. Surprisingly, we found that ~ 60% of the URPC phages are lytic, higher than any other environments we have surveyed, including the terrestrial, marine, aquatic, freshwater, plants, and human gut (Fig. 4A). Moreover, we also observed a similar elevated proportion of lytic phages within two rumen viral genome dataset (RVD and moose rumen [26]). Lytic phages are often isolated from the sewage [22]; our results thus provided better alternatives for lytic phage isolation.

To further facilitate future application of the URPC phages, we predicted their microbial hosts using public MAG datasets, including several ruminant GIT MAGs and those of the other environments. Of particular interest, we obtained 109 phages targeting methane-producing species in all six archaeal genera by mining the phage-host relationships; of which 74 were lytic ones. Previous studies have shown that phages targeting methanogens may help reduce methane emissions [13], but we lack a large-scale method for identifying such phages [95]. Therefore, our results will facilitate the targeted isolation of lytic phages against methanogens and other bacterial/archaeal species in general.

Overall, our assembly and analysis of the URPC phages massively expanded the ruminant GIT phages and paved the way for microbiome intervention to improve the ruminant and environmental quality.

Conclusions

We filled the gap in ruminant viral ecology research by providing a catalog of phage genomes and identifying many lytic viruses that could target methane producers. Our findings provide insights into the phage community of the ruminant GIT and can be used as a starting point for future research on microbiome manipulation in ruminants.