Background

Whole genome sequencing (WGS) of bacterial isolates using the next-generation sequencing technology has progressively become the predominant method in clinical microbiology, public health surveillance, and disease control [1,2,3]. The ability to study the complete genetic information of a large number of bacterial genomes provides the potential to generate insights into the pathogenic genotype/phenotype relationships [4,5,6], pathogenic virulence transmissibility [7, 8] and antibiotic resistance tracking [9, 10]. The combination of genomics information and epidemiological data has been frequently used in disease control processes, such as rapid outbreak clustering investigation of the recent SARS-CoV-2 pandemic [11, 12] and evolutionary perspectives inference/prediction with regards to pathogenic diversification [13, 14]. The richness of current high-throughput genomic data has created a solid foundation to establish systematic studies for large cohorts of related genomes by applications of genome-wide methods such as cgMLST, phylogenetic, or pangenomic analyses. WGS approaches can generate insightful data to discern knowledge about existing pathogenesis and assist in unraveling the characteristics of unknown ones [15, 16], which is critical in understanding and thus controlling disease outbreaks.

To meet the demand for analysis tools, a number of computational pipelines have been developed to facilitate the analysis of microbial WGS data and to generate practical results of interest. Several have become well-established and widely used in the field, notably Nullarbor [17], Bactopia [18], and ASA3P [19]. The first-mentioned tool, Nullarbor, has been around as part of a standard process in public health microbial genomic procedure, while the latter two are relatively up-to-date with comprehensive and wide-spectrum functionalities. However, these software pipelines usually require high-end computation infrastructures and take prohibitively long running times to analyze when collection sizes reach beyond thousands of genomes. Furthermore, while it is typical for laboratories to collect and sequence new samples over time, none of the existing pipelines can efficiently manage the growing collections where new samples are constantly added. In most cases, many parts of these pipelines need to be rerun every time new samples are added to the collection, resulting in additional high computation costs.

Here we introduce AMRomics, a lightweight open-source software for analyzing and managing large collections of bacterial genomes. This tool offers the ability to generate essential genomic results for individual samples, together with a population analysis that outperforms other methods. Thanks to its optimal design, the performance is significantly improved, making analyses of big collections of bacteria feasible on regular desktop computers with reasonable turn-around time. AMRomics project source code is available at https://github.com/amromics/amromics.git

Workflow and implementation

AMRomics is a software package that provides a comprehensive suite of genomics analyses of microbial collections in a simple and easy to use manner. It is designed to be performant and scalable to large genome collection with minimal hardware requirements without compromising the analysis results. To that end, we select the considered best practices tools in microbial genomics, and stitch them together via a well-structured workflow as described in the next section. For certain tasks in the workflow, AMRomics provides options for users to select among several alternative tools. The workflow is written in Python and is designed as a modular and expandable application with the standardized data formats flowing between the tools in the workflow.

The software flexibly takes in input data in various formats including sequencing reads (with Illumina, Pacbio and Nanopore technologies), genome assembly, and genome annotations. It then performs assembly, genome annotation, MLST, virulome and resistome prediction, pangenome clustering, phylogenetic tree construction for each gene and core genes, and pan-SNPs analysis, all with a simple command line. AMRomics achieves this by building a pipeline consisting of the current best practice tools in bacterial genomics. It is also designed to be fast, efficient, and scalable to collections of thousands of isolates on a computer with modest hardware. Crucially, AMRomics supports the progressive analysis of a growing collection, where new samples can be added to an existing collection without the need to build the collection from scratch.

Functionally, the AMRomics pipeline can be split into 2 stages: single sample analysis and collection analysis as depicted in Fig. 1. In the single sample analysis stage, every sample is processed based on the type of input data. Specifically, for Illumina sequencing data, fastp [20, 21] is employed for quality control, adaptor trimming, quality filtering and read pruning. The pre-processed reads are then subject to sequence assembly to generate a genome assembly. SKESA [22] is the method of choice for assemblying Illumina sequencing data for its speed, but the user can optionally choose to use SPAdes [23, 24] for slightly better N50 with the extra computation time. If long read data (Nanopore and Pacbio) are provided, the sample genome is assembled by Flye [25]. The assembly step can be skipped if the user provides the genome assembly in FASTA format as input to the pipeline. AMRomics then standardizes the sample IDs and the contig names to ensure their uniqueness in the collection. Next, the genome assembly is annotated with Prokka [26] unless the annotations are provided by the user. The gene sequences are extracted and stored in files at predefined locations. The genome sequence is also subject to multi-locus strain typing with pubMLST database of typing scheme for bacterial strains [27], antibiotic-resistant gene identification with AMRFinderPlus database [28], virulent gene identification with the virulence factor database VFDB [29, 30], and plasmid detection with plasmidfinder database [31]s. All the results of the single sample analysis are organized in a standard manner.

Fig. 1
figure 1

AMRomics workflow. A The workflow framework includes modules for single sample analysis and for collection analysis. B Details of the single sample analysis module. The genome assembly sequence of the sample is generated by genome assemblers if the input data are in sequencing reads (FASTQ); or by standardization of the input genome sequence. The genome assembly sequence is then subject to MLST, genome annotation, resistome and virulome detection, and plasmid detection. C Details of the collection analysis module. The annotation (GFF) files from single sample analysis are the input of the pangenome analysis step. The gene cluster information from the pangenome step is used for creating MSA, phylogeny for each gene, and species phylogeny for samples in the collection

In the second stage, AMRomics performs a pangenome comparative analysis of the genome collection. The annotations of all the genomes in GFF format are loaded into a pangenome inference module for gene clustering. PanTA [32] is the method of choice for pangenome construction for its speed and scalability, but users can optionally choose Roary [33] as the alternative. AMRomics then classifies gene clusters into core genes (genes clusters that present in at least 95% of genomes in the collection) and accessory genes. In addition, AMRomics identifies shell genes, which are those present in at least a certain number of genomes in the pangenome. The threshold for shell genes is defaulted at 25% but can be adjusted by users. AMRomics then performs multiple alignments (MSA) of all the identified shell genes using MAFFT [34]. The MSAs of these shell genes are then used to construct phylogenetic trees of these gene families using FastTree 2 [35] or IQ-TREE 2 [36]. In addition, AMRomics builds the phylogeny of the collection from the concatenation of the MSAs of all core genes using the chosen tree-building method.

AMRomics introduces pan-SNPs, a novel concept to represent genetic variants of the samples in the collection. Existing variant analysis methods usually rely on a reference genome, and can only identify variants in the genes presenting in the reference genome. This severely limits the analysis to only a fraction of the genome of interest because of the high variability between isolates within a clade. In addition, it is often not possible to have a reference genome that can represent the whole collection, especially if the collection is diverse and growing. AMRomics addresses this by building the pan-reference genome for the collection which is the set of representative genes from each gene cluster. It then identifies the variants of all genes in a cluster against the representative gene directly from the MSA. The variant profile of a sample is the concatenation of the variations of all its genes, reported in a VCF file.

The representative gene for a gene cluster is set to be the one that comes from the earliest genome in the collection list. With this selection strategy, if the users have a preferred reference genome, they can place the reference genome first in the collection list so that genes from the reference genome will be the representatives in their perspective clusters. Moreover, as AMRomics supports adding new samples into a collection, the selection strategy also ensures that the representative gene for a cluster does not change as the new samples are added into the collection, and that a new representative gene is added to the pan-reference genome only if a new cluster is created as the result of the collection expansion.

All results obtained from running AMRomics can be ultimately aggregated as the final output for reporting or customized visualizations for end users. Details of the third-party bioinformatics tools and databases used by AMRomics are listed in Supplementary Table 1 and 2.

Results

Comparison with other pipelines

To the best of our knowledge, at the time of writing, there are four existing open source software pipelines for end to end microbial genomics analysis, namely Nullarbor [17], TORMES [37], ASA3P [19] and Bactopia [18]. While AMRomics and these software tools share the overall functionalities, they differ in the underlying philosophies. Here, we present a high level discussion of AMRomics features and highlight the principles behind the design of AMRomics.

Overall, AMRomics and the existing tools support a wide variety of input formats except Nullarbor and TORMES which are designed to run on Illumina paired-end reads only as per their specific public health routine. AMRomics and the more recent methods, ASA3P [19] and Bactopia accept raw reads from third-generation sequencing technology such as Oxford Nanopore Technology or PacBio long reads. A range of genomics analyses are included in all pipelines. They are common tasks for bacteria genomics such as sequence typing (MLST), AMR/virulence factor scanning, and genome annotation for an isolate. While all of the tools provide SNP analysis results, AMRomics outputs variants (in VCF files) by the core gene alignment from the pangenome analysis instead of snippy [38] core alignment as in other methods. Table 1 summarizes the key features across the software tools.

Table 1 Functional comparison between AMRomics and other bacterial genomics pipeline

The primary principle of AMRomics is to extract the highest quality and most informative statistics from the input data. For example, AMRomics constructs the phylogeny tree of the collection using the multiple alignment of core genes. This provides a higher resolution of evolutionary information than SNPs information or the multiple alignment of 16S genes [39], the two techniques applied by the existing tools. In addition, AMRomics utilizes the population information to call variants across the pangenome instead of from a chosen reference genome and hence provides a bigger picture of genetic relations among the isolates in the collection. The users can still use one or more preferred reference genomes by placing the reference genomes at the top of the list.

AMRomics’s second and perhaps equally important design principle emphasizes on the scallability of the software, aiming to enable the analysis of large collections without the need to scale up hardware infrastructures. While AMRomics uses the same underlying core tools (e.g., BLAST+, SPAdes, SKESA, Flye, Prokka etc) as other pipelines, we chose to reimplement the helper and pre-processing modules such as Shovill and Dragonflye. In the process, we pay attention to the data structures to manage large amount of data flowing between steps of the pipeline. As a result, AMRomics is significantly faster and requires only a fraction of memory usage in comparison with its counterparts (shown in the following section). While speed is the paramount, AMRomics offers the flexibility for users to choose between alternatives to fit their need when there are more than one core algorithms for the same step (such as SPAdes and SKESA for assembling short reads, or FastTree and IQ-TREE for phylogenetic tree construction). AMRomics also takes advantage of progressive analysis; when new samples are added into an existing collection, AMRomics only performs the extra computation related to the new samples, instead of recomputing scratch. This strategy offers a scalable solution practically suitable for analysis of the large growing collections of bacteria in the sequencing ages.

Case study

We demonstrate the utility of AMRomics on a large and heterogeneous set of Klebsiella pneumoniae genomes collected from various public sources. In particular, we designed a case study that reflects a practical use case and highlights the ease of use, flexibility and scallability of AMRomics. The input data of the case study consisted of three batches of genome data. The first batch contained the sequencing data of 89 K. pneumoniae isolates from Patan Hospital in Kathmandu, Nepal between May and December 2012 [40]. These samples were multi-drug resistant isolates, in the form of Illumina paired-end short read data. While AMRomics did not require a reference genome for variant calling, we included in the batch four genome assemblies obtained from RefSeq (two in the genome assembly fasta format and two in annotation GFF format) for the other workflows to use as the reference. In the second batch, we included 11 samples that were collected from Hospital Universitario Ramon y Cajal, exhibiting Carbapenem resistance and harboring the pOXA-48 plasmid [41]. The input data for these 11 samples were Oxford Nanopore sequencing data. Finally, we included a third batch of 1000 samples; the genomes in the batch were previously assembled and annotated by NCBI PGAP, and they were in GFF format. The data for the case study are provided in the Supporting data.

Despite the commonalities among the analysis pipelines, having a direct comparison can be challenging due to the variations in the processing steps and the selection of different analysis tools within each pipeline. For simplicity, we used the default settings to run all existing pipelines that would cover essential analyses as shown in Table 1. We also with the best effort to use the parameters that the most compatible with AMRomics (Supporting data). We did not include TORMES in the comparison because of its resemblance to its predecessor, Nullarbor. The experiments were conducted on a cloud server with moderate performance, equipped with a 6-core 12-thread E-2286G processor, 32GB of RAM, and a 960GB SSD drive.

Table 2 shows the running time and resource consumption using the four pipelines. For the first batch, AMRomics took only 4.32 hours for performing single sample analysis on 89 samples, significantly faster than Bactopia and Nullarbor with 8.82 hours and 11.09 hours respectively even though the three pipelines use similar underlying algorithms (SKESA for Illumina read assembly, Prokka for annotation and BLAST for virulome and resistome calling). This is likely due to better process management and parallelization implemented in AMRomics software. ASA3P took much longer, 22.24 hours as a result of using a slower assembly algorithm SPAdes that typically produces higher N50 quality assemblies. Of note, AMRomics, Bactopia and Nullarbor could optionally use SPAdes as the short read assembler. It is also worth noting that variation calling was part of single analysis in Bactopia, ASA3P and Nullarbor which also contributed to the extended single analysis time of these tools. AMRomics took under 1 hour for collection analyses, including pangenome inference, multiple alignment of cloud genes, phylogenetic analyses of organisms and of every cloud gene, and SNP analysis. Nullarbor performed collection analysis in much shorter time, 0.19 hours albeit producing only pan-genome and core-gene phylogeny. Bactopia and ASA3P took significantly longer, 2.32 hours and 12.24 hours respectively. Taking together, AMRomics required less than half of the time of the other tools for the whole pipeline. It also consumed only 3.44Gb of memory, comparing with 5.83Gb by Bactopia, 20.86Gb by ASA3P and 7.91Gb by Nullarbor.

Table 2 Running times and memory usages of AMRomics, Bactopia, ASA3P and Nullarbor in the case study

The second batch consists of 11 Nanopore sequencing data, that was not supported by Nullarbor. ASA3P did not support progressive analysis hence all samples in the first batch and second batch had to be analyzed from scratch leading to a total of 54.76 hours. Bactopia took 1.86 hours for single sample analysis which was shorter than AMRomics that took 2.74 hours though both tools used Flye as underlying assembly. Upon examining the runtimes, we noticed that Bactopia performed subsampling of sequencing reads to 50x resulting in the speed-up. AMRomics took less than one hour for collection analysis thanks to the use of progressive mode of its underlying pangenome method PanTA. On the other hand, Bactopia took 3.99 hours.

The genomes in the third batch were already annotated in GFF format. We did not run ASA3P on the third batch because of the excessive time required to re-analyze the samples in the previous batches. Bactopia did not have the function to extract the annotations in the GFF files, and instead re-annotated the input genomes. In addition, Bactopia simulated sequencing reads from the assembled genomes, and mapped the simulated reads back to the reference to call SNPs. These steps, while could produce the intended analysis results, took 67.72 hours to analyze 1000 genomes. On the other hand, AMRomics reused the existing annotations from the input genomes, leading to substantially shorter single sample analysis running time, only 4.17 hours. Similarly, the pangenome analysis strategy employed by AMRomics reused the existing pangenome computation, requiring only 2.44 hours to add 1000 genomes into the existing pangenome. Bactopia ran pangenome analysis for more than 20 hours before crashing due to out of memory.

Discussion

We introduce AMRomics, a lightweight and scalable computational pipeline to analyze bacterial genomes and pan-genomes cost-effectively. Our method’s main focus is optimizing the workflow and selected sub-modules for microbial genomic studies, especially comparative genomics, and most importantly, supporting progressive analysis for growing big data collections. AMRomics provides flexible input scenarios by supporting a wide range of data formats, such as different types of raw reads, assemblies, or annotated genomes for each sample depending on data availability or pipeline settings from end users. It can generate fundamental genomic properties sample-by-sample by conducting routine analyses for bacteria isolates, and comparative genomics for the whole big collection i.e. pangenome evaluation and the corresponding phylogenetic results. Analysis results from AMRomics can be directly imported into AMRViz [42], a visualization tool for viewing and visually inspection of the analysis results.

AMRomics leverages the wealth of bioinformatics and genomics tools available to develop an end-to-end analysis workflow. While focusing on efficacy and scalability, opportunities exist for enhancing and broadening its functionality. We are continuously updating the pipeline with new methods to provide alternative options for each available function, or novel ones, to meet the various needs of end-users. For instance, the default genome annotation module in the community has been Prokka [26], but recent tools such as Bakta [43] and PGAP [44] are becoming prominent; such tools will be incorporated into the pipeline to provide the alternatives to tailor to users’ needs. Another direction to enhance the application of AMRomics is to consider species-specific downstream analyses besides the core general-purpose modules. This extra practice is required in many scenarios of microbial genomics surveillance, especially in public health settings. Exemplars of such tools include various bug-specific serotyping methods: SISTR [45] for Salmonella, Shigatyper [46] for Shigella and PneumoCAT [47] for Streptococcus pneumoniae.

In summary, AMRomics is a useful tool that can manage and enable the scale-up of large bacterial collections with modest computational resources. Continuing support for new modules and workflow maintenance will make it another practical option for the booming era of microbial genomics data.