High-throughput identification of viral termini and packaging mechanisms in virome datasets using PhageTermVirome

Garneau, Julian R.; Legrand, Véronique; Marbouty, Martial; Press, Maximilian O.; Vik, Dean R.; Fortier, Louis-Charles; Sullivan, Matthew B.; Bikard, David; Monot, Marc

doi:10.1038/s41598-021-97867-3

High-throughput identification of viral termini and packaging mechanisms in virome datasets using PhageTermVirome

Article
Open access
Published: 15 September 2021

Volume 11, article number 18319, (2021)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

High-throughput identification of viral termini and packaging mechanisms in virome datasets using PhageTermVirome

Download PDF

Julian R. Garneau¹^na1,
Véronique Legrand²^na1,
Martial Marbouty³,
Maximilian O. Press⁴,
Dean R. Vik⁵,
Louis-Charles Fortier⁶,
Matthew B. Sullivan⁵,
David Bikard⁷ &
…
Marc Monot¹

3412 Accesses
5 Citations
5 Altmetric
Explore all metrics

An Author Correction to this article was published on 09 May 2022

This article has been updated

Abstract

Viruses that infect bacteria (phages) are increasingly recognized for their importance in diverse ecosystems but identifying and annotating them in large-scale sequence datasets is still challenging. Although efficient scalable virus identification tools are emerging, defining the exact ends (termini) of phage genomes is still particularly difficult. The proper identification of termini is crucial, as it helps in characterizing the packaging mechanism of bacteriophages and provides information on various aspects of phage biology. Here, we introduce PhageTermVirome (PTV) as a tool for the easy and rapid high-throughput determination of phage termini and packaging mechanisms using modern large-scale metagenomics datasets. We successfully tested the PTV algorithm on a mock virome dataset and then used it on two real virome datasets to achieve the rapid identification of more than 100 phage termini and packaging mechanisms, with just a few hours of computing time. Because PTV allows the identification of free fully formed viral particles (by recognition of termini present only in encapsidated DNA), it can also complement other virus identification softwares to predict the true viral origin of contigs in viral metagenomics datasets. PTV is a novel and unique tool for high-throughput characterization of phage genomes, including phage termini identification and characterization of genome packaging mechanisms. This software should help researchers better visualize, map and study the virosphere. PTV is freely available for downloading and installation at https://gitlab.pasteur.fr/vlegrand/ptv.

Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data

Article Open access 21 April 2023

viralFlye: assembling viruses and identifying their hosts from long-read metagenomics data

Article Open access 21 February 2022

Discovery of an expansive bacteriophage family that includes the most abundant viruses from the human gut

Article 13 November 2017

Introduction

Viruses play key roles in diverse microbial ecosystems. They are recognized to influence biogeochemical cycling, modulate microbial populations and metabolism, and act as a driving force in gene flow in soil and the oceans^1,2,3,4,5,6. In humans, viruses can also provide a first line of non-host derived immunity⁷ and are associated with health status^8,9,10,11,12. Such discoveries have led to an increasing interest in characterizing viruses and their impact on ecosystems. However, although metagenomic surveys have revealed hundreds of thousands of viral genomes and large genome fragments^{13,14,15,16,17,18}, these new genomes likely only scratch the surface of the viruses existing in these environments^15,18.

A considerable proportion of the reads obtained when sequencing the human gut virome typically has no significant homology to known viral genomes¹⁹. For example, Manrique et al. reported that only a small subset of active phages detected in the gut microbiome could be taxonomically classified and that more than half (≈ 60%) could represent entirely unknown novel bacteriophages. Similarly, datasets from the deeply sequenced Global Ocean Viromes show that only a minimal fraction of reads (≈ 10–20%) can be mapped to large de novo assembled viral reference genome databases, such as GOV1 and GOV2 databases^13,14. The term “viral dark matter” has been used to describe this vast unknown sequence space^19,20,21. Recent efforts to establish viral clusters in gene-sharing networks have revealed structures that will greatly help in virus identification and classification, but a large number of unknown clusters likely remain to be discovered^22,23. Despite the continuous rise in the number of sequences compiled in these databases, researchers will undoubtedly keep finding unknown sequences in abundance for the foreseeable future. It is thus still important to ascertain the viral origin of sequences discovered in metagenome analysis with higher confidence. Phage and virus biologists have therefore focused their efforts on developing orthogonal complementary prediction approaches, such as the analysis of k-mer frequencies and viral genome features, also recently implemented through machine-learning algorithms^24,25.

It has been previously shown that the ends of linear DNA, a hallmark of genomic DNA (gDNA) packaged in the capsid of free phage particles, can be easily detected from shotgun sequencing data if the sequencing library was prepared in a way that preserves such DNA ends (i.e., random fragmentation of DNA prior to library preparation)^26,27. Shotgun sequencing approaches typically rely on the ligation of adapters to randomly fragmented DNA. During this process, DNA fragments with one end that corresponds to natural genome termini will be strongly enriched relative to other DNA fragments that end at random positions due to the random fragmentation of DNA during library preparation. This enables the recognition of termini by analyzing the number of reads that start at each position along a given phage sequence, normalized against the whole sequence coverage. The number of read starts mapping precisely at the natural termini will be over-represented, because it directly correlates with the number of genomes (i.e., viral particles) present in a sample. We previously used this concept and approach to identify the genome termini and characterize packaging mechanisms of individual pure phages²⁷.

Here, we introduce PhageTermVirome (PTV), an easy-to-use bioinformatics software that efficiently identifies critical information such as phage genome termini and packaging mechanisms in modern large-scale viral metagenomics datasets. Due to its ability to detect termini present in capsid-packaged linear DNA, PTV also offers an entirely orthogonal and complementary approach to predict the true viral origin of metagenomic contigs, along with other existing phage prediction tools^{28,29,30,31,32,33,34,35,36}.

Results and discussion

Overview of the PhageTermVirome workflow

The key steps of the PTV analysis workflow are depicted in Fig. 1. They consist of: (i) mapping all viral metagenomic reads on contigs assembled from the same read dataset, (ii) calculation of the starting position coverage (SPC) and identification of positions on the contig for which the SPC is significantly higher, which may represent genome ends (termini) and secondary terminase cutting sites, (iii) if applicable, characterization of the corresponding packaging mechanism by analysis of the coverage patterns near the identified termini, and (iv) aggregation of the prediction results and coverage plots for all contigs submitted to PTV into a single PDF file for visual inspection. A CSV file containing the predictions and statistics for each contig is also provided to users for easier consultation and manipulation of the results in a table format, which can be useful when there is a very large number of contigs. For contigs with successful predictions, PTV also outputs a new fasta file containing the contigs reorganized to start at the predicted termini.

The PTV algorithm was developed to allow the analysis of multiple contigs in a single workflow (i.e., users do not have to re-run the script for each contig they want to analyze). The mandatory inputs into PTV required for analysis are (i) a multifasta (.fasta) file containing the assembled contigs and (ii) a fastq file (.fastq) containing the reads used to assemble the contigs found in the fasta file. When available, the user can also provide the corresponding paired-end fastq read file. Although not required, the use of paired-end reads is recommended, as it has been shown to improve termini and packaging mechanism prediction²⁷.

The PTV algorithm is largely based on the mapping of all reads of a dataset to each contig provided, which can represent a daunting task for ultra-deeply sequenced viromes containing a vast amount of reads and a large number of contigs. Additional options and parameters have thus been added to the PTV scripts to accelerate and scale the process. For example, in addition to the multi-threading option initially implemented in the original PhageTerm software²⁷ (using the command-line option --core), a new multi-machine functionality has also been deployed in PTV (using the command-line option --mm). The multi-machine option can be used to go over the limit of the number of cores available on a machine when processing large datasets and thus accelerate the analysis. It is intended for advanced users who have the possibility to perform analyses on multi-machine computer clusters. As PTV can still rapidly process average-sized virome datasets using only multi-threading options, the multi-machine option becomes more attractive for extra-large datasets (e.g., ≳ 100 million reads with ≳ 1000 contigs). Many factors can influence the required time to complete a PTV analysis, such as the total number of reads, the total number and length of contigs to analyze, the speed of the processors, and the number of cores available for parallelization.

Detection of phage termini and packaging mode in a control mock virome dataset

We first assessed the ability of PTV to correctly identify termini and packaging modes in metagenomics datasets using a mock viral metagenomics dataset. This dataset was built by pooling 1 × 10⁷ paired-end reads from five reference phages, using 2 × 10⁶ paired-end reads from each phage previously sequenced individually²⁷. HK97, T4, T7, P1, and Lambda phage paired-end reads were used, as these phages harbor a diversity of well-described termini and packaging mechanisms. After running the PTV workflow on the mock read dataset and the reference phage genomes, the expected termini positions were identified and corresponding packaging mechanisms successfully defined by PTV for all five reference phages (Fig. 2): Lambda (5′cos), HK97 (3′cos), P1 (headful with pac site), and T7 (direct terminal repeat, DTR). As expected, PTV returned no signal for T4 (headful without precise pac site). It is known that T4 type phages generate no precise or detectable termini, as their DNA is randomly packaged inside the capsid. This type of phage packages full genomes with redundant ends, starting and ending at random positions, generating no observable over-represented sequencing biases after sequencing. Our results on the mock virome dataset suggest that PTV algorithms, coverage analyses, and statistical models are sufficiently robust (assuming sufficient coverage) to accurately detect termini and classify packaging mechanisms when the reads of different phages are merged, as is the case for real virome metagenomic datasets.

Characterization of phage termini and packaging mode in two real virome metagenomics datasets

Given the successful characterization of phage termini and packaging mechanisms in the mock virome dataset, we next applied PTV to two real virome datasets. The first dataset selected to test PTV was a human gut virome previously sequenced and published in the study by Manrique et al.³⁷. The second dataset was generated during the course of this study and also corresponds to a human gut virome (Fig. 3). In addition to analyzing the two virome datasets with PTV, we also performed a phage prediction analysis with three widely used software programs, MetaPhinder³⁰, VirSorter²⁹ and Phaster³¹. Concomitant analysis of the assembled contigs was performed with these software programs and PTV to verify whether our tool performs better on contigs that are typically predicted to be phages. Thus, we produced Venn diagrams showing contigs individually and commonly predicted to be phage by each tool and compared the results with the successful termini PTV predictions.

For the Manrique dataset, PTV predicted termini and the packaging mode for 44 contigs from a total of 155 assembled contigs. Among the 44 contigs, 28 were predicted to be phage by the other softwares, meaning that PTV could determine termini and the packaging mode on contigs (N = 16) that were not declared to be phage by any other prediction tools. Detailed analysis of these 16 contigs showed a diversity of predicted packaging modes (4 cos phages, 10 pac phages, and 2 DTR phages).

In the second human virome dataset, generated for this study, PTV was able to predict termini and the packaging mode for 72 contigs (from a total of 596), among which 50 were also predicted to be phage by the other phage prediction tools. Thus, PTV obtained predictions for a number of contigs (N = 22) in this dataset that were not identified as phage by any of the other tools. The predicted packaging modes for these 22 contigs were also diverse, with 8 cos phages, 8 pac phages, and 6 DTR phages. One finding that arises from this comparative analysis is that each phage prediction software performs unevenly on different datasets and that it is useful, even necessary, to use a combination of different tools to obtain a more complete and precise vision of the landscape of the phages in a virome dataset.

In light of these results, we wished to understand why certain contigs with successful PTV predictions could not be established as phage by any of the three prediction tools used in this study. Thus, the 38 contigs (16 + 22) were annotated with PROKKA using the latest PHASTER phage protein database (as of December 2020). For each contig, if an ORF was not annotated with a known protein in the database, a second annotation was performed using PROKKA’s default bacterial protein database. Proteins of phage origin could be found on most of the contigs but, in many cases, represented a minority of the total predicted ORFs. The high number of hypothetical proteins found on these contigs likely explains, at least partially, why they were not identified as phage by the other prediction tools used here. Another hypothesis is that these contigs may be fragments of a partially assembled phage genome, as is often the case in virome metagenomic datasets³⁸. In such fragmented assemblies, hallmark phage genes required by certain tools to declare the contig as phage may be lacking. However, given sufficient read coverage, PTV is still able to determine termini and packaging modes on a contig, even if a large part of the corresponding phage genome is missing from the assembly. This feature also allows PTV to discriminate between sequences of active phages and those of purely decaying or inactive prophages (which can be present in contaminating bacterial genomes, but which ultimately do not form viral particles packaging linear DNA with termini). It is difficult to confirm the completeness of a phage contig obtained after the sequencing of a virome sample, especially when using short-read sequencing. To date, the most reliable way to ensure full genome recovery is still probably the sequencing of pure cultured phage lysates. Sequencing of virome DNA using long-read technologies also shows great promise in retrieving more complete phage genomes³⁹.

Since only a few phage proteins could be identified after annotation of our contigs for which we obtained a termini and packaging mode prediction, we sought to perform a more in-depth analysis using vConTACT2²³ in order to obtain additional clues about the taxonomy of those contigs (Fig. 4). In this analysis, 12 of 16 contigs from the Manrique dataset clustered with at least one contig from the reference viral databases included here. It also showed that 17 of 22 contigs from the dataset generated for the study also clustered with contigs from the reference viral databases. Further in-depth analyses of the contigs found to be singletons will help determine the nature of these sequences and whether they represent unknown and distant phages or other entities capable of packaging DNA, such as phage-inducible chromosomal islands (PICIs), genomic islands, or other classes of transducing agents⁴⁰.

Limitations and perspectives for PhageTermVirome

The capacity of PTV to detect termini and identify packaging mechanisms depends on the protocol used to prepare the nucleic-acid libraries prior to sequencing. Indeed, the detection of DNA termini can only occur if they are preserved during library preparation. Methods such as tagmentation, used for example in the Nextera kits from Illumina, insert adapters within DNA fragments in a non-random fashion (creating non-natural sequencing biases for read start positions) and are therefore incompatible with the natural bias-detection procedure of PTV. As previously mentioned, to enable data analysis using PTV, the sequencing library must be prepared through the ligation of adapters to DNA that has been randomly fragmented (e.g., Covaris, sonication), or any DNA preparation in which the natural DNA ends have been preserved. (e.g., Illumina TruSeq PCR free and TruSeq Nano, NebNext, Accel-NGS 1S PluS DNA library Kit). Note that the approach is constrained by the sequencing library preparation methods but not the sequencing technology itself. Although only Illumina sequencing was assessed here, other sequencing technologies or approaches, such as SMRT PacBio sequencing, Nanopore sequencing, and the recently developed VirION2 approach⁴³, are theoretically also well suited to work with PTV. These sequencing technologies exhibit higher error rates than short-read sequencing approaches, but this should not strongly affect the ability of PTV to detect termini and determine packaging mechanisms. The first step of the PTV pipeline is based on the perfect alignment of only short seed sequences, of which the length can be adjusted if needed (default is 20-mers at the start of each read).

It is important to keep in mind that the algorithm of PTV is based on read alignments and the detection of over-represented sequences. Thus, the performance of PTV may be affected by very low coverage depth. Users must keep in mind that PTV will make more confident calls for contigs with a reasonable mean coverage. PTV can work at low mean coverage (e.g., 10 ×) in some cases, but often requires higher mean coverage for the unequivocal detection of termini and packaging mechanism (e.g., around 30 ×). Coverage depth can be variable from one contig to another in virome datasets and can be very low for viral particles that are scarce in the sample. PTV can often identify the position of potential termini for contigs with very low coverage, but may have difficulty in properly determining the packaging mechanism. PTV issues a warning for contigs exhibiting very low coverage and invites users to manually inspect the results and consult the coverage-pattern plots to confirm ambiguous results.

For extra-large virome datasets (a massive number of reads and a very large quantity of contigs), PTV analysis may require long processing times, especially if the analysis is being performed with very few computing cores. In this first version of PTV, a multi-threading option is available, as well as a newly implemented multi-machine option. When parallelization is not possible, alternatives can be used to reduce preventable computing time, such as excluding undesired or irrelevant contigs (e.g., very short contigs ≤ 1 kb or other contigs representing non-viral contamination). The minimal contig limit size can be customized in PTV using the “--limit” option (default is 500 bp). However, as previously mentioned, we expect PTV to be useful for termini detection and characterization for entities known to package small non-random DNA fragments (e.g., Dinoroseobacter shibae gene transfer agents can package 4.2 kb dsDNA⁴⁴. Other examples include particles packaging very small linear DNA or RNA fragments, such as satellite DNA or RNA viruses, as well as linear satellite RNA (satRNA). Group 1 large single-stranded satRNAs can package linear RNA of 0.7–1.5 kb and group 2 single-stranded satRNAs typically package linear RNA fragments < 700 bp (21,994,595). Thus, we recommend carefully choosing the metagenomics contigs excluded from the analysis. Another recommendation, also left to the discretion of the user, is to pre-select contigs with a minimal coverage threshold (e.g., reject contigs ≤ 10 × mean coverage), which will reduce the processing time and also help to generate more robust results.

The PhageTerm approach is naturally restricted to the detection of viruses that package linear nucleic acids. In viral metagenomic samples, dsDNA phages of the caudovirales order represent most sequences known to date, but a large number of diverse phages are yet to be discovered. Recent studies have shed light on the previously underestimated prevalence of single-stranded DNA (ssDNA) phages, including Innoviridae and Microviridae phages^45,46,47, but the ssDNA viruses discovered thus far appear to package circular DNA and therefore escape detection and characterization by PTV. As certain eukaryotic viruses can package linear ssDNA with inverted terminal repeats, such as Parvoviruses and Bidensoviruses^48,49, we have to consider the possibility that linear ssDNA phages may also be found in nature and that they may be detected and characterizable by PTV. Our strategy is also not restricted to bacteriophages. Thus, viruses that package linear nucleic acids that infect any kingdom, including ssDNA viruses and linear RNA viruses, should also be characterizable by PTV. The detection of the termini of linear RNA viruses requires custom sample preparation protocols which were not explored here. Protocols akin to those used to identify transcription starts in mRNA sequencing could readily be used to identify the 5′ end of RNA viruses⁵⁰, whereas methods similar to those used to identify transcription terminators could be used to identify the 3′ end⁵¹. In addition, emerging methods involving the direct sequencing of RNA fragments in a sample (e.g., Nanopore direct RNA sequencingg)^52,53 are expected to be compatible with PTV and should also allow the efficient detection of RNA phage termini and the characterization of their packaging mechanism.

We hypothesized that PTV should be able detect DNA ends not only in phages and viruses, but also in other biological entities, such as PICIs^54,55 and other transducing agents with conserved DNA ends. For example, a recent study has described the packaging of non-random fragments of bacterial genomes inside phage capsids in more detail⁴⁰. The authors of this study showed that the precise packaging of bacterial DNA is made possible by the presence of sites resembling phage pac sites at multiple locations throughout the host genome. Until recently, gene transfer agents (GTA), which are phage-like particles, were thought to package bacterial DNA in a random fashion only, but recent studies have shown that certain GTAs (i.e., those in the dinoflagellate-associated bacterium Dinoroseobacter shibae) can package non-random DNA fragments, presumably also using a headful packaging mechanism⁵⁶. Like phages, these entities are also of great interest, as they are increasingly alleged to be involved in horizontal gene transfer, host adaptation, and bacterial virulence^57,58. The ability to define the exact start and end of these sequences by identifying the termini should help in more precisely defining and characterizing these particles and their role in various ecosystems.

Going forward, the detection and thorough characterization of viral contigs in metagenomic datasets will benefit from the use of a combination of tools based on various approaches^59,60. In summary, PTV is a new and distinct tool for the high-throughput characterization of phage genomes that should substantially accelerate the comprehensive study of the virosphere.

Materials and methods

Origin of samples, library preparation, sequencing, and contig assembly

Three sequence datasets were used in this study: (i) a simulated mock virome containing reads collected from five reference phages sequenced in our previous study (Lambda, HK97, T7, P1, Mu)²⁷, in which the mock virome was generated by randomly subsampling 2 × 10⁶ paired-end reads from each phage dataset and grouping them in a single forward and reverse read file (1 × 10⁷ paired-end reads in total), (ii) a human gut virome sequencing dataset (SRR4295172) from a previously published study³⁷, and (iii) a human gut virome sequencing dataset generated for this study. For this study, viral particles were separated and purified as follows: 500 mg of fecal matter was resuspended in sodium citrate solution (50 mL) and centrifuged at 300×g for 10 min at 4 °C. The supernatant was recovered and centrifuged at 5000×g for 45 min at 4 °C to pellet the bacteria. The supernatant was again recovered and diluted (1:5) with cold SM buffer (200 mM NaCl, 10 mM MgSO₄, 50 mM Tris pH 7.5) and filtrated using 0.45 μm and 0.2 μm polyethersulfone membranes. For viral particle precipitation, PEG 6000 was added to a final concentration of 10% and the samples maintained at 4 °C overnight. The following morning, samples were centrifuged at 6000×g for 1 h at 4 °C and the pellets resuspended in 2 ml cold SM buffer and treated with an equal volume of chloroform. After centrifugation at 15,000×g for 5 min at 4 °C, the aqueous phase containing viral particles was recovered. SDS (0.5% final concentration) and proteinase K (20 mg/mL) were added and the samples incubated for 3 h at 65 °C. DNA was extracted using phenol chloroform, precipitated with ethanol, and resuspended in 10 mM Tris–HCl. DNA was sonicated using a Covaris S220 apparatus. The sequencing library was prepared using a TruSeq Illumina kit, 2 × 75 bp paired end, and sequenced on a NextSeq550 Illumina sequencer. The run generated 102,581,131 paired-end reads (15.6 Gbases total). The sequence dataset was deposited under SRR11177648.

Read quality control and virome assembly

To generate a proper assembly of all virome datasets, raw reads were trimmed and cleaned using AlienTrimmer v.0.4.0⁶¹ and the NextFlex PCR Free adapter list, with the following options: -l 30. Cleaned reads were then assembled using metaSPAdes v.3.10.0⁶² with the following parameters: --meta -k 21,33,55,77,99,127 --threads 12. Contigs ≥ 10 kbp were retained for further analyses.

Prediction and detection of phage contigs

Assembled contigs were analyzed for viral/phage detection using three different popular user-friendly software programs: MetaPhinder v.2.1³⁰, VirSorter v.1.0.3²⁹, and PHASTER (using multiple separate contigs option)³¹. All software was used with the default parameters.

PhageTermVirome analyses

PTV v.4.0.0 was run in the paired-end mode with the default options for all datasets analyzed in this study.

Phage contig annotation

Selected contigs were annotated with PROKKA v.1.14.0⁶³, run locally, using the PHASTER curated database as the primary source of trusted phage proteins (as of December 22, 2020), with an E-value threshold of 10⁻³.

Protein network analysis with vConTACT2

Network analysis with vConTACT2 v.0.9.22 was performed using three reference contig databases (GVD⁸, ViralRefSeq V.97⁴¹, and a database of curated virus-like particles from the mouse from Duerkop et al.⁴². The visualization of the protein-sharing network was done using Cytoscape software v.3.1.1; http://cytoscape.org/)⁶⁴. The edge-weighted spring-embedded model was used to position genomes sharing most protein clusters in proximity to one another.

Nucleotide sequence accession numbers

The raw read data of the human fecal virome from this study was deposited in the sequence read archive (SRA) under the accession number SRR11177648.

Data availability

PhageTermVirome source code is available at GitHub as a standalone software written in python3 (https://gitlab.pasteur.fr/vlegrand/ptv). A conda virtual environment is distributed with the software and all installation instructions are specified in the included “README.txt” file. Other datasets generated during the current study are available from the corresponding author on reasonable request.

Change history

09 May 2022
A Correction to this paper has been published: https://doi.org/10.1038/s41598-022-12064-0

References

Suttle, C. A. Viruses in the sea. Nature 437, 356–361. https://doi.org/10.1038/nature04160 (2005).
Article ADS CAS PubMed Google Scholar
Suttle, C. A. Marine viruses—Major players in the global ecosystem. Nat. Rev. Microbiol. 5, 801–812. https://doi.org/10.1038/nrmicro1750 (2007).
Article CAS PubMed Google Scholar
Brum, J. R. & Sullivan, M. B. Rising to the challenge: Accelerated pace of discovery transforms marine virology. Nat. Rev. Microbiol. 13, 147–159. https://doi.org/10.1038/nrmicro3404 (2015).
Article CAS PubMed Google Scholar
Fuhrman, J. A. Marine viruses and their biogeochemical and ecological effects. Nature 399, 541–548. https://doi.org/10.1038/21119 (1999).
Article ADS CAS PubMed Google Scholar
Forterre, P. The virocell concept and environmental microbiology. ISME J. 7, 233–236. https://doi.org/10.1038/ismej.2012.110 (2013).
Article CAS PubMed Google Scholar
Rosenwasser, S., Ziv, C., van Creveld, S. G. & Vardi, A. Virocell metabolism: Metabolic innovations during host–virus interactions in the ocean. Trends Microbiol. 24, 821–832. https://doi.org/10.1016/j.tim.2016.06.006 (2016).
Article CAS PubMed Google Scholar
Barr, J. J. et al. Bacteriophage adhering to mucus provide a non-host-derived immunity. Proc. Natl. Acad. Sci. U.S.A. 110, 10771–10776. https://doi.org/10.1073/pnas.1305923110 (2013).
Article ADS PubMed PubMed Central Google Scholar
Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724-740.e8. https://doi.org/10.1016/j.chom.2020.08.003 (2020).
Article CAS PubMed PubMed Central Google Scholar
Broecker, F., Klumpp, J. & Moelling, K. Long-term microbiota and virome in a Zürich patient after fecal transplantation against Clostridium difficile infection. Ann. N. Y. Acad. Sci. 1372, 29–41. https://doi.org/10.1111/nyas.13100 (2016).
Article ADS PubMed Google Scholar
Ma, Y., You, X., Mai, G., Tokuyasu, T. & Liu, C. A human gut phage catalog correlates the gut phageome with type 2 diabetes. Microbiome 6, 24. https://doi.org/10.1186/s40168-018-0410-y (2018).
Article PubMed PubMed Central Google Scholar
Monaco, C. L. et al. Altered virome and bacterial microbiome in human immunodeficiency virus-associated acquired immunodeficiency syndrome. Cell Host Microbe 19, 311–322. https://doi.org/10.1016/j.chom.2016.02.011 (2016).
Article CAS PubMed PubMed Central Google Scholar
Norman, J. M. et al. Disease-specific alterations in the enteric virome in inflammatory bowel disease. Cell 160, 447–460. https://doi.org/10.1016/j.cell.2015.01.002 (2015).
Article CAS PubMed PubMed Central Google Scholar
Gregory, A. C. et al. Marine DNA viral macro- and microdiversity from pole to pole. Cell 177, 1109-1123.e14. https://doi.org/10.1016/j.cell.2019.03.040 (2019).
Article CAS PubMed PubMed Central Google Scholar
Roux, S. et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537, 689–693. https://doi.org/10.1038/nature19366 (2016).
Article CAS PubMed Google Scholar
Brum, J. R. et al. Ocean plankton. Patterns and ecological drivers of ocean viral communities. Science 348, 1261498. https://doi.org/10.1126/science.1261498 (2015).
Article CAS PubMed Google Scholar
Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430. https://doi.org/10.1038/nature19094 (2016).
Article ADS CAS PubMed Google Scholar
Emerson, J. B. et al. Host-linked soil viral ecology along a permafrost thaw gradient. Nat. Microbiol. 3, 870–880. https://doi.org/10.1038/s41564-018-0190-y (2018).
Article CAS PubMed PubMed Central Google Scholar
Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098-1109.e9. https://doi.org/10.1016/j.cell.2021.01.029 (2021).
Article CAS PubMed PubMed Central Google Scholar
Aggarwala, V., Liang, G. & Bushman, F. D. Viral communities of the human gut: Metagenomic analysis of composition and dynamics. Mob. DNA 8, 12. https://doi.org/10.1186/s13100-017-0095-y (2017).
Article CAS PubMed PubMed Central Google Scholar
Youle, M., Haynes, M. & Rohwer, F. Scratching the surface of biology’s dark matter. In Viruses Essent. Agents’s Life (ed. Witzany, G.) 61–81 (Springer Netherlands, 2012).
Chapter Google Scholar
Roux, S., Hallam, S. J., Woyke, T. & Sullivan, M. B. Viral dark matter and virus-host interactions resolved from publicly available microbial genomes. Elife 4, e08490. https://doi.org/10.7554/eLife.08490 (2015).
Article PubMed Central Google Scholar
Bolduc, B. et al. vConTACT: An iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. PeerJ 5, e3243. https://doi.org/10.7717/peerj.3243 (2017).
Article PubMed PubMed Central Google Scholar
Bin Jang, H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639. https://doi.org/10.1038/s41587-019-0100-8 (2019).
Article CAS PubMed Google Scholar
Amgarten, D., Braga, L. P. P., da Silva, A. M. & Setubal, J. C. MARVEL, a tool for prediction of bacteriophage sequences in metagenomic bins. Front. Genet. 9, 304. https://doi.org/10.3389/fgene.2018.00304 (2018).
Article CAS PubMed PubMed Central Google Scholar
Tampuu, A., Bzhalava, Z., Dillner, J. & Vicente, R. ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples. PLoS One 14, e0222271. https://doi.org/10.1371/journal.pone.0222271 (2019).
Article CAS PubMed PubMed Central Google Scholar
Li, S. et al. Scrutinizing virus genome termini by high-throughput sequencing. PLoS One 9, e85806. https://doi.org/10.1371/journal.pone.0085806 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Garneau, J. R., Depardieu, F., Fortier, L.-C., Bikard, D. & Monot, M. PhageTerm: A tool for fast and accurate determination of phage termini and packaging mechanism using next-generation sequencing data. Sci. Rep. 7, 8292. https://doi.org/10.1038/s41598-017-07910-5 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Guo, J. et al. VirSorter2: A multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37. https://doi.org/10.1186/s40168-020-00990-y (2021).
Article PubMed PubMed Central Google Scholar
Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. VirSorter: Mining viral signal from microbial genomic data. PeerJ 3, e985. https://doi.org/10.7717/peerj.985 (2015).
Article CAS PubMed PubMed Central Google Scholar
Jurtz, V. I., Villarroel, J., Lund, O., Voldby Larsen, M. & Nielsen, M. MetaPhinder-identifying bacteriophage sequences in metagenomic data sets. PLoS One 11, e0163111. https://doi.org/10.1371/journal.pone.0163111 (2016).
Article CAS PubMed PubMed Central Google Scholar
Arndt, D. et al. PHASTER: A better, faster version of the PHAST phage search tool. Nucleic Acids Res. 44, W16–W21. https://doi.org/10.1093/nar/gkw387 (2016).
Article CAS PubMed PubMed Central Google Scholar
Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. VirFinder: A novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69. https://doi.org/10.1186/s40168-017-0283-5 (2017).
Article PubMed PubMed Central Google Scholar
Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: Automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 90. https://doi.org/10.1186/s40168-020-00867-0 (2020).
Article CAS PubMed PubMed Central Google Scholar
Ajami, N. J., Wong, M. C., Ross, M. C., Lloyd, R. E. & Petrosino, J. F. Maximal viral information recovery from sequence data using VirMAP. Nat. Commun. 9, 3205. https://doi.org/10.1038/s41467-018-05658-8 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Tithi, S. S., Aylward, F. O., Jensen, R. V. & Zhang, L. FastViromeExplorer: A pipeline for virus and phage identification and abundance profiling in metagenomics data. PeerJ 6, e4227. https://doi.org/10.7717/peerj.4227 (2018).
Article CAS PubMed PubMed Central Google Scholar
Auslander, N., Gussow, A. B., Benler, S., Wolf, Y. I. & Koonin, E. V. Seeker: Alignment-free identification of bacteriophage genomes by deep learning. Nucleic Acids Res. 48, e121. https://doi.org/10.1093/nar/gkaa856 (2020).
Article CAS PubMed PubMed Central Google Scholar
Manrique, P. et al. Healthy human gut phageome. Proc. Natl. Acad. Sci. U.S.A. 113, 10400–10405. https://doi.org/10.1073/pnas.1601060113 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Sutton, T. D. S., Clooney, A. G., Ryan, F. J., Ross, R. P. & Hill, C. Choice of assembly software has a critical impact on virome characterisation. Microbiome 7, 12. https://doi.org/10.1186/s40168-019-0626-5 (2019).
Article PubMed PubMed Central Google Scholar
Somerville, V. et al. Long-read based de novo assembly of low-complexity metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system. BMC Microbiol. 19, 143. https://doi.org/10.1186/s12866-019-1500-0 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kleiner, M., Bushnell, B., Sanderson, K. E., Hooper, L. V. & Duerkop, B. A. Transductomics: Sequencing-based detection and analysis of transduced DNA in pure cultures and microbial communities. Microbiome 8, 158. https://doi.org/10.1186/s40168-020-00935-5 (2020).
Article CAS PubMed PubMed Central Google Scholar
Brister, J. R., Ako-Adjei, D., Bao, Y. & Blinkova, O. NCBI viral genomes resource. Nucleic Acids Res. 43, D571–D577. https://doi.org/10.1093/nar/gku1207 (2015).
Article CAS PubMed Google Scholar
Duerkop, B. A. et al. Murine colitis reveals a disease-associated bacteriophage community. Nat. Microbiol. 3, 1023–1031. https://doi.org/10.1038/s41564-018-0210-y (2018).
Article CAS PubMed PubMed Central Google Scholar
Zablocki, O. et al. VirION2: A short- and long-read sequencing and informatics workflow to study the genomic diversity of viruses in nature. Microbiology https://doi.org/10.1101/2020.10.28.359364 (2020).
Article Google Scholar
Bárdy, P. et al. Structure and mechanism of DNA delivery of a gene transfer agent. Nat. Commun. 11, 3034. https://doi.org/10.1038/s41467-020-16669-9 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Roux, S. et al. Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes. Nat. Microbiol. 4, 1895–1906. https://doi.org/10.1038/s41564-019-0510-x (2019).
Article CAS PubMed PubMed Central Google Scholar
Tisza, M. J. et al. Discovery of several thousand highly diverse circular DNA viruses. Elife 9, e51971. https://doi.org/10.7554/eLife.51971 (2020).
Article CAS PubMed PubMed Central Google Scholar
Creasy, A., Rosario, K., Leigh, B. A., Dishaw, L. J. & Breitbart, M. Unprecedented diversity of ssDNA phages from the family microviridae detected within the gut of a protochordate model organism (Ciona robusta). Viruses https://doi.org/10.3390/v10080404 (2018).
Article PubMed PubMed Central Google Scholar
Krupovic, M. & Forterre, P. Single-stranded DNA viruses employ a variety of mechanisms for integration into host genomes. Ann. N. Y. Acad. Sci. 1341, 41–53. https://doi.org/10.1111/nyas.12675 (2015).
Article ADS CAS PubMed Google Scholar
Krupovic, M. & Koonin, E. V. Evolution of eukaryotic single-stranded DNA viruses of the Bidnaviridae family from genes of four other groups of widely different viruses. Sci. Rep. 4, 5347. https://doi.org/10.1038/srep05347 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Stamatoyannopoulos, J. A. Illuminating eukaryotic transcription start sites. Nat. Methods 7, 501–503. https://doi.org/10.1038/nmeth0710-501 (2010).
Article CAS PubMed Google Scholar
Dar, D. et al. Term-seq reveals abundant ribo-regulation of antibiotics resistance in bacteria. Science 352, aad9822. https://doi.org/10.1126/science.aad9822 (2016).
Article CAS PubMed PubMed Central Google Scholar
Garalde, D. R. et al. Highly parallel direct RNA sequencing on an array of nanopores. Nat. Methods 15, 201–206. https://doi.org/10.1038/nmeth.4577 (2018).
Article CAS PubMed Google Scholar
Lorenz, D. A., Sathe, S., Einstein, J. M. & Yeo, G. W. Direct RNA sequencing enables m6A detection in endogenous transcript isoforms at base-specific resolution. RNA 26, 19–28. https://doi.org/10.1261/rna.072785.119 (2020).
Article CAS PubMed PubMed Central Google Scholar
Martínez-Rubio, R. et al. Phage-inducible islands in the Gram-positive cocci. ISME J. 11, 1029–1042. https://doi.org/10.1038/ismej.2016.163 (2017).
Article CAS PubMed Google Scholar
Penadés, J. R. & Christie, G. E. The phage-inducible chromosomal islands: A family of highly evolved molecular parasites. Annu. Rev. Virol. 2, 181–201. https://doi.org/10.1146/annurev-virology-031413-085446 (2015).
Article CAS PubMed Google Scholar
Tomasch, J. et al. Packaging of Dinoroseobacter shibae DNA into gene transfer agent particles is not random. Genome Biol. Evol. 10, 359–369. https://doi.org/10.1093/gbe/evy005 (2018).
Article CAS PubMed PubMed Central Google Scholar
Soucy, S. M., Huang, J. & Gogarten, J. P. Horizontal gene transfer: Building the web of life. Nat. Rev. Genet. 16, 472–482. https://doi.org/10.1038/nrg3962 (2015).
Article CAS PubMed Google Scholar
Fogg, P. C. M. Identification and characterization of a direct activator of a gene transfer agent. Nat. Commun. 10, 595. https://doi.org/10.1038/s41467-019-08526-1 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Bolduc, B., Youens-Clark, K., Roux, S., Hurwitz, B. L. & Sullivan, M. B. iVirus: Facilitating new insights in viral ecology with software and community data sets imbedded in a cyberinfrastructure. ISME J. 11, 7–14. https://doi.org/10.1038/ismej.2016.89 (2017).
Article PubMed Google Scholar
Paez-Espino, D. et al. IMG/VR v.2.0: An integrated data management and analysis system for cultivated and environmental viral genomes. Nucleic Acids Res. 47, D678–D686. https://doi.org/10.1093/nar/gky1127 (2019).
Article CAS PubMed Google Scholar
Criscuolo, A. & Brisse, S. AlienTrimmer removes adapter oligonucleotides with high sensitivity in short-insert paired-end reads. Commentary on Turner (2014) Assessment of insert sizes and adapter content in FASTQ data from NexteraXT libraries. Front. Genet. 5, 130. https://doi.org/10.3389/fgene.2014.00130 (2014).
Article CAS PubMed PubMed Central Google Scholar
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: A new versatile metagenomic assembler. Genome Res. 27, 824–834. https://doi.org/10.1101/gr.213959.116 (2017).
Article CAS PubMed PubMed Central Google Scholar
Seemann, T. Prokka: Rapid prokaryotic genome annotation. Bioinformatics (Oxf., Engl.) 30, 2068–2069. https://doi.org/10.1093/bioinformatics/btu153 (2014).
Article CAS Google Scholar
Kohl, M., Wiese, S. & Warscheid, B. Cytoscape: Software for visualization and analysis of biological networks. In Data Min. Proteomics Stand. Appl. (eds Hamacher, M. et al.) 291–303 (Humana Press, 2011).
Chapter Google Scholar

Download references

Funding

This work was supported by the French Government’s program Investissement d’Avenir program; Laboratoire d’Excellence ‘Integrative Biology of Emerging Infectious Diseases’ [ANR-10-LABX-62-IBEID] and the Natural Sciences and Engineering Research Council of Canada (NSERC #341450-2010). Biomics Platform, C2RT, Institut Pasteur, Paris, France, is supported by France Génomique (ANR-10-INBS-09-09) and the Infrastructures de recherche en biologie, santé et agronomie (IBISA). M.M. and J.G. were supported by the “CDPhages” ANR JCJC grant from ANR-18-CE35-0011.

Author information

These authors contributed equally: Julian R. Garneau and Véronique Legrand.

Authors and Affiliations

Biomics Platform, C2RT, Institut Pasteur, 75015, Paris, France
Julian R. Garneau & Marc Monot
Infrastructure et Ingénierie Scientifique, Institut Pasteur, 75015, Paris, France
Véronique Legrand
Institut Pasteur, Unité Régulation Spatiale des Génomes, UMR 3525, CNRS, 75015, Paris, France
Martial Marbouty
Phase Genomics Inc, Seattle, WA, 98109, USA
Maximilian O. Press
Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA
Dean R. Vik & Matthew B. Sullivan
Faculty of Medicine and Health Sciences, Department of Microbiology and Infectious Diseases, Université de Sherbrooke, Sherbrooke, QC, J1E 4K8, Canada
Louis-Charles Fortier
Département de Microbiologie, Institut Pasteur, Groupe Biologie de Synthèse, 75015, Paris, France
David Bikard

Authors

Julian R. Garneau
View author publications
You can also search for this author in PubMed Google Scholar
Véronique Legrand
View author publications
You can also search for this author in PubMed Google Scholar
Martial Marbouty
View author publications
You can also search for this author in PubMed Google Scholar
Maximilian O. Press
View author publications
You can also search for this author in PubMed Google Scholar
Dean R. Vik
View author publications
You can also search for this author in PubMed Google Scholar
Louis-Charles Fortier
View author publications
You can also search for this author in PubMed Google Scholar
Matthew B. Sullivan
View author publications
You can also search for this author in PubMed Google Scholar
David Bikard
View author publications
You can also search for this author in PubMed Google Scholar
Marc Monot
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.R.G., M.Mo. and D.B. designed the study. J.R.G., V.L., M.Mo., and D.B. wrote the PhageTermVirome code. M.Ma. prepared the data from the human virome. J.R.G., M.Mo., V.L., D.R.V. and M.O.P. performed the bioinformatics analyses. J.R.G., M.Mo., D.B. and M.B.S. wrote the manuscript. J.R.G., V.L., M.Mo., D.B., M.O.P., D.R.V., L.-C.F., M.Ma., M.B.S. contributed to the manuscript and have read and accepted the final version.

Corresponding authors

Correspondence to Julian R. Garneau or Marc Monot.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this Article was revised: The Funding section in the original version of this Article was incomplete. The Funding section now reads: “This work was supported by the French Government’s program Investissement d’Avenir program; Laboratoire d’Excellence ‘Integrative Biology of Emerging Infectious Diseases’ [ANR-10-LABX-62-IBEID] and the Natural Sciences and Engineering Research Council of Canada (NSERC #341450-2010). Biomics Platform, C2RT, Institut Pasteur, Paris, France, is supported by France Génomique (ANR-10-INBS-09-09) and the Infrastructures de recherche en biologie, santé et agronomie (IBISA). M.M. and J.G. were supported by the “CDPhages” ANR JCJC grant from ANR-18-CE35-0011.”

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Garneau, J.R., Legrand, V., Marbouty, M. et al. High-throughput identification of viral termini and packaging mechanisms in virome datasets using PhageTermVirome. Sci Rep 11, 18319 (2021). https://doi.org/10.1038/s41598-021-97867-3

Download citation

Received: 31 July 2021
Accepted: 27 August 2021
Published: 15 September 2021
DOI: https://doi.org/10.1038/s41598-021-97867-3
Springer Nature Limited

This article is cited by

Chromosome folding and prophage activation reveal specific genomic architecture for intestinal bacteria
- Quentin Lamy-Besnier
- Amaury Bignaud
- Martial Marbouty
Microbiome (2023)
Guidelines for public database submission of uncultivated virus genome sequences for taxonomic classification
- Evelien M. Adriaenssens
- Simon Roux
- Bas E. Dutilh
Nature Biotechnology (2023)

High-throughput identification of viral termini and packaging mechanisms in virome datasets using PhageTermVirome

Abstract

Similar content being viewed by others

Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data

viralFlye: assembling viruses and identifying their hosts from long-read metagenomics data

Discovery of an expansive bacteriophage family that includes the most abundant viruses from the human gut

Introduction