Background

RNA-Seq is a technology for the study of the transcriptome based on next-generation sequencing (NGS). Developed since the 2000s, it quickly became one of the methods of choice in the study of differential expression in various fields. One of these is the study of tumors and among them the leukemias, including acute myeloid leukemias (AML), where the RNA-Seq is used with increasing frequency either to characterize the disease or for diagnostic and risk assessment prognosis [1]. For this reason, it becomes important to use with efficiency and simplicity the tools that allow to operate standard analysis, from differential analysis to Pathway and Gene Ontology analyses. An automated pipeline would be useful for this purpose and would save time for analysis.

There are already many pipelines for the analysis of RNA-Seq data, but often they do not include final tertiary-analysis or quality control or there is no possibility to explore the results as a whole through an interactive report. Evaluating the 29 pipelines cited on Wikipedia [2] we noticed that none met all main requirements that could be of interest for an RNA-Seq analysis (see Additional file 1). For this reason, we decided to develop our own pipeline that was as complete as possible in dealing with each step.

Given the number of tools available, the first step was to understand which ones to introduce into the pipeline. In fact, having the possibility to choose between different pipelines can be useful, but at the same time having too many choices can be counterproductive, also because it requires a higher number of prerequisites to install and consequently also a greater storage space. We have therefore selected eight of the most used pipelines in the RNA-Seq analysis and we have chosen six based on the results of sensitivity and specificity obtained on simulation data, as well as for peaks of RAM reached and time taken. The aim is not to evaluate which are the best tools in the different phases of the analysis, as other papers have already dealt with these aspects [3,4,5,6,7] and for this reason we have selected pipelines already published, but simply choose a small number of pipelines to include in the tool taking into consideration generic and computational aspects.

Implementation

Before using ARPIR, it is necessary to install a series of prerequisites and for this purpose a configuration script is provided which can be launched on any Debian-based system. In other cases, the script must be edited or it is possible to resort to manual installation.

ARPIR can be launched through a graphical interface, developed through the zenity [8] software in a bash script GUI_ARPIR.sh. Once the interface has been started, in the first window you are asked to enter: project name, which will be the name of the main folder; pool name, which represents a subfolder in which the results will be written; sample names, separated by comma, without spaces and ordered to have first the controls and then the treated (both at least in duplicate); sample types, in the same order of names and separated by comma, represent the belonging of the samples to one or the other group; log name, that is the name of the file in which the analysis log will be saved; comparisons of interest. In the second window it should be specified if it is a paired-end or single-end analysis. It is then requested to select, in order: the read 1 and any read 2 belonging to the various samples, the reference genome (and related indexes), the BED file, the genome PhiX (a control genome commonly used as a control for Illumina sequencing runs), the ribosomal genome 1 (5s rDNA) and 2 (rDNA), the GTF file, the reference genome, the directory where the scripts are located, the output directory, the directory in which to save the log, the library type, the pipeline desired for alignment, quantification and Differential Expression Analysis (DEA), if you want to make a final tertiary-analysis, and if so, the number of categories you want to display in the final plots (numbers above five tend to decrease readability), the number of threads to use. Finally, there is a summary table to check all inputs and parameters entered.

For launching from the command line the same information is required. The command in this case can be launched using a template (see Additional file 1).

The Pipeline

The main analysis is the Python script ARPIR.py (Fig. 1), in which all the various scripts are recalled, with the exception of the Shiny App, which can be launched from the command line once the analysis is complete.

Fig. 1
figure 1

Workflow of ARPIR pipeline. The ARPIR pipeline, starting from the input files and parameters, performs an RNA-Seq analysis. First of all the primary-analysis, a quality control on the FastQ files occurs, followed by a pre-processing and alignment, which can be done through TopHat2, HISAT2 or STAR, finally there is a new quality control on the BAM files. The secondary-analysis is the quantification and differential analysis, which can follow the featureCounts-edgeR, featureCounts-DESeq2 or Cufflinks-cummeRbund pipelines. Then an optional tertiary-analysis follows, composed of a GO analysis and a Pathway analysis. The results obtained can then be viewed in a Shiny App and possibly downloaded to a report

After a series of checks to verify the existence of the input files and the correctness of the parameters, the first script called is the alignment (alignment.sh, alignment_se.sh for single-end). In addition, an input.csv file is created in which the sample information will be saved, which are necessary for subsequent analysis. The alignment script acts on one sample at a time, in the order in which they are inserted.

Before the actual alignment, a quality control is performed with FastQC [9] on the supplied files and the report for the various samples is saved in the “Quality” folder. Then the analysis with FastQ Screen [10] follows to identify the genome contaminant and also in this case the results are saved in the “Quality” folder.

Then the pre-processing phase for PhiX and ribosomal genome removal begins. Through bwa mem [11] the FastQ files are aligned on the contaminant genome and the reads filtered with samtools [12, 13] to keep only the best alignments. The procedure is repeated both for the PhiX genome and for the ribosomal genome (for each sample).

Then the actual alignment phase follows, which can be performed with TopHat2 [14], HISAT2 [15] and STAR [16]. On the BAM files obtained, a quality control is performed using RSeQC [17], in particular by using inner distance, junction annotation, junction saturation, bam statistics, and read distribution. The results are saved in a “RSeQC” folder inside the folder with the BAM files.

In addition to the BAM files, bigwig and bedgraph files are also produced in the same folder; they are useful to generate tracks that can be loaded—for visual inspection and comparison analysis—on genome browsers (e.g. UCSC Genome Browser, IGV and similar).

In this phase, two summary tables are also generated, saved in the “report” folder, with information regarding both the parameters used and the reads removed from the samples during the pre-processing.

The next script that is invoked is the one for quantification and DEA, quantification.py. Quantification can occur through featureCounts [18] or Cufflinks and in both cases the data contained in the input.csv file are used to trace the metadata and path of the samples. In the case of Cufflinks, quantification takes place through a series of successive steps that involve the cufflinks, cuffmerge and cuffquant functions and to these follows the differential analysis phase with cuffdiff (as described in [19]).

The R scripts for the final DEA analysis are then called. In the case of featureCounts, runedgeR.R or runDESeq2.R can be called and in the case of Cufflinks runcummeRbund.R. In all three cases, first the FPKM matrix for the samples is calculated and used to create the PCA plot, saved in the “Quantification and DEA” folder both in png and in pdf format.

Then the actual differential analysis follows. For edgeR [20], we first have a selection of genes that are represented at least one counts per million (CPM) in at least two samples; the filtering is performed independently of which sample belongs to which group so that no bias is introduced. Then follows a normalization through the Trimmed Mean of M values (TMM) method, an estimate of the dispersions (in order common, trended and tagwise dispersions) and finally the differential analysis, with a first fit to the Generalized Linear Model (GLM) followed by a Likelihood Ratio Test (LRT). TMM is the recommended for most RNA-Seq data where the majority (more than half) of the genes are believed not differentially expressed between any pair of the samples [21]; for this reason it is the default method used for normalization in edgeR. For DESeq2 [22] a predefined analysis is performed through the steps: estimation of size factors, estimation of dispersion, and Negative Binomial GLM fitting and Wald statistics. The results are then subjected to a logarithmic transformation. As already mentioned, in the Cufflinks pipeline the differential analysis has already been done, so cummeRbund is used only to import and make the results readable. Finally, in all three cases the results are written in tables in csv format, where we can find the name of the gene, the value of log2-Fold Change, the adjusted p value and the values of FPKM for the various samples. A series of plots are also created and saved in png and pdf format: a volcano plot, showing the Fold Change and the p value for all the genes, a heatmap of the 100 genes with greater variance for all the samples, a correlation heatmap between the samples, to verify the similarity between replicates and between samples subjected to different treatment.

The last invoked script is GO_pathway.R for the tertiary-analysis. An enrichment analysis is performed in the Gene Ontology database [23] on genes with an adjusted p value of less than 0.05 and an absolute value of log2-Fold Change greater than 1.5 using the clusterProfiler package [24]. The results are saved in two tables: one presents the terms of Gene Ontology enriched in order of p value and the list of genes present in each group, with the relative information of p value, q value and count; the other presents a gene enriched by row, the GO term for which it is enriched and the value of Fold Change. Furthermore, three types of graphs are generated: a treemap with all the terms of GO enriched and in which the dimensions are proportional to the number of genes; three dotplots (one for each GO domain), in which the five (default) most enriched categories are shown; three cnetplots, in which the same categories and their genes are shown in the form of networks. Then the Pathway analysis follows, performed using an enrichment test (via the clusterProfiler package) in the KEGG database [25,26,27]. Even in this case the results are saved in a csv table and in a dotplot and in a cnetplot. Furthermore, images of the most enriched pathways are generated with the Fold Changes of genes through the pathview package [28].

All results are saved in specific folders with the tables in csv and the plots in pdf and png.

To make the pipeline automated and not require a massive user intervention in the choice of parameters, we have made ARPIR a rigid pipeline, in which it is not possible to add tools or methods other than those provided.

ShinyApp and reporting

You can view them interactively via the Shiny App with the command:

figure a

The results are shown in a series of tabs divided according to the analysis step (see Additional file 1 for further details) and the whole can be downloaded in a pdf, docx or html report. In addition to the results, the report also includes the parameters used and a brief description of the various steps taken during the analysis.

The summary tab of the Shiny App contains two tables. The first table shows the parameters chosen for the analysis, while the second shows a summary for the various samples. In particular, for each sample is reported the type (control or treated), the number of raw reads, which are the reads in the initial FastQ file, the number of PhiX reads, which are the reads removed because they are part of PhiX contaminating genome, and the number of ribosomal reads, which are instead the reads removed because belonging to the ribosomal RNA.

The FastQ quality tab contains the plots obtained from the quality analysis carried out with FastQ-Screen and FastQC on the various samples. The first plot is the output of FastQ-Screen and shows the percentage of DNA of sample reads mapped on human, murine, PhiX and ribosomal genomes. The second plot is instead the output of FastQC and shows the quality of the reads of the sample calculated using the Phred Score contained in the FastQ files. The quality is showed through boxplots based on the position on the read. The values found in the green zone are considered of good quality. The drop-down menu in sidebar allows to browse through the various samples.

The quality of the BAM files was evaluated using RSeQC software. In particular, in the first table statistics related to the reads mapping are reported, in the second table reads fractions mapped on the coding exon part, on the 5′-UTR region, on the 3′-UTR region and on the intronic or intragenic regions are reported. The first image shows the distribution of the internal distance between paired reads; the second image shows number of splicing junctions per percent of total reads and splits junctions in all, known and novel; the third and the fourth images show percentage of novel splicing junctions and events.

For the differential expression analysis, a summary table with the results and FPKM values for each sample is given. Then a series of summary plots follows and in particular: a PCA to evaluate the differences between the samples; a volcano plot, which reports the values of Fold Change and p-value for all genes; a heatmap of the 100 genes with greater variance, in which the value of the Z-score is reported and therefore the distance from the mean for the various samples; a heatmap showing the distances between the samples, calculated in a distance matrix using the Euclidean distance method.

The outputs of the Gene Ontology analysis are two summary tables, the first one showing a GO term for each line while the second one a gene for each line. Then three interactive networks follow, one for each category of GO, which allow to view the enriched genes and their Fold Change. Finally, there are a treemap, where the size of each rectangle is proportional to the number of genes, and three dotplots, which report the five terms of GO that were more enriched for the genes.

For the Pathway analysis a summary table with enriched pathways is shown, followed by an interactive network similar to that of GO. Finally, there is a dotplot with the five most enriched pathways.

Materials and methods

The in vitro data we analyzed came from RNA-Seq experiments performed on two different AML cell lines with NPM1 mutation: OCI-AML3 [29] and IMS-M2 [30]. In both cases the treatment conditions were compared with the conditions without treatment. For each condition the experiment was done in triplicate. The kit used for the preparation of the sample was the TruSeq RNA [31], while the sequencer used for the sequencing was HiSeq 2500 by Illumina [32] in rapid run and with a flow cell. Sequencing occurred in paired-end and using two lanes for sample. The two lanes corresponding to the same sample have been merged into a single file before the alignment phase.

To validate the results of DEA, we used also in silico RNA-Seq data. We have resorted to the R package polyester [8] to generate a set of samples in paired-end belonging to two different groups and with three replicates per group (see Additional file 2). The FastQ files thus obtained were analyzed through the eight pipelines until the differential expression data were obtained (see Additional files 3, 4, 5, 6, 7, 8, 9, 10). To obtain sensitivity and specificity values that were representative of the entire pipeline, we also considered missing alignments and false alignments [33] (see Additional file 1 for more details).

Both the data obtained from the cell line samples and the simulated RNA-Seq data were submitted to the same analysis.

An initial quality analysis was performed on FastQ files using FastQC [9] software and a contaminant genome evaluation using FastQ-Screen [10]. We then removed the PhiX genome and the ribosomal genome by identifying sequences through alignment on samples with bwa.

The eight pipelines we have considered are those that use some of the most popular tools (Fig. 2). In particular, for alignment we considered three different aligners: TopHat2 [14], HISAT2 [15], STAR [16] and kallisto [34]. The first three are based on a classical alignment performed on the genome through different algorithms, while the fourth one uses a pseudo-alignment on the transcriptome. As for quantification, we have selected four software: Cufflinks [35], StringTie [36], featureCounts [18] and kallisto [34] itself, which in addition to the alignment also performs quantification. The first two resort to a statistical approach to quantification, while the third resort to a simple read count. Finally, for the differential analysis we used five different R packages: cummeRbund [37], Ballgown [38], DESeq2 [22], edgeR [20] and sleuth [39]. Each uses a different statistical approach, except cummeRbund which merely shows the results, while the true analysis is performed by Cufflinks.

Fig. 2
figure 2

Plot of the eight tested pipelines. We tested different pipelines with the most used tools. We considered four aligners, four software for quantification and five R packages for differential analysis

Results and discussion

For the eight pipelines, we analyzed the time taken for the various processes and the peaks of memory used. The data in Table 1 show the values for the analysis in the pipelines using 4 threads per process. The data shown are related to the analysis on the 6 samples of the IMS-M2 cell line (see Additional file 1 for more details).

Table 1 Table of the times and the RAM memory peaks reached during the RNA-Seq analysis for the eight pipelines

By first evaluating the pipelines in terms of time consumption it was found that the slowest pipeline is that of TopHat2 and the fastest is that of kallisto. From the point of view of memory consumption, all pipelines are quite similar. Interestingly, however, note that all the peaks are kept below 5 Gb, so the analysis in all cases can be carried forward even on a normal PC. However, this is not the case with STAR, which requires 30 Gb of RAM to work at its best.

As well as comparing time and RAM memory performance, we then compared the pipelines through the results obtained. We calculated sensitivity and specificity using the simulated RNA-Seq data (see Additional file 1 for more details) (Fig. 3).

Fig. 3
figure 3

Sensitivity and specificity values for the eight pipelines. Sleuth and Ballgown give the worst results, while DESeq2 and edgeR are distinguished by the low number of false negatives (sensitivity) and cummeRbund due to the low number of false positives (specificity)

In the light of these results, it emerges that, although kallisto is the method that requires shorter times, the bad results in terms of sensitivity and specificity make it unreliable compared to other pipelines. Pipelines that use STAR or HISAT2 for alignment and featureCounts for quantification are the ones that give better results, as they present the best values of sensitivity and specificity. In terms of time they require highly less time than the TopHat2 pipeline. However, STAR pipelines are the ones that require more consumption of RAM and for this reason we preferred to choose one of the other pipelines as the default option, as they can also be launched on a normal PC. edgeR and DESeq2 have very similar results both for time and for RAM peak and for sensitivity and specificity, but edgeR proves slightly higher in the differential analysis.

We have thus chosen to include in the tool TopHat2-Cufflinks-cummeRbund, STAR-Cufflinks-cummeRbund, STAR-featureCounts-DESeq2, STAR-featureCounts-edgeR, HISAT2-featureCounts-DESeq2 and HISAT2-featureCounts-edgeR pipelines and to make the latter the default option (Fig. 4).

Fig. 4
figure 4

Workflow of HISAT2-featureCounts-edgeR pipeline. The alignment, launched in bash, requires FastQ files and the reference genome and leads to BAM files, which in turn become inputs for featureCounts that generates a count matrix. Finally, the differential analysis with edgeR generates a differential expression matrix

Automatic RNA-Seq Pipelines with Interactive Report (ARPIR) carry out the entire RNA-Seq analysis and can be used by command line, but a graphical interface is also available that, through a series of successive windows, allows to choose between different methods of alignment, quantification and differential expression analysis. In addition to the standard analysis, ARPIR also performs a series of quality controls and a pre-processing, as well as an optional final tertiary-analysis. In particular, it makes quality control on FastQ files, pre-processing, alignment, quality control on BAM files, transcript quantification and differential expression analysis. Given the input files and the working directory, ARPIR is completely automated. First, quality control on FastQ files is performed with FastQC e FastQ-Screen. FastQC makes quality control and creates one report for sample. FastQ-Screen estimates approximately the percentage of reads that can be mapped on genomes other than human, like ribosomal genome, PhiX genome and mouse genome. This allows to evaluate the presence of contaminating genomes. Pre-processing follows quality control: the reads are aligned on PhiX genome and ribosomal genome to eliminate contaminations. Alignment can be performed with TopHat2, HISAT2 or STAR; in the first case quantification is performed with Cufflinks and DEA with cummeRbund, in the second case quantification is performed with featureCounts and DEA with DESeq2 or edgeR, in the third case it is possible to choose one of the previous methods for quantification and DEA. A second intermediate quality control is also performed on the aligned BAM files with some of the RSeQC scripts and in particular: inner distance, junction annotation, junction saturation, bam stat, read distribution.

It is possible to perform an optional tertiary-analysis on the results. It consists in Gene Ontology enrichment and KEGG Pathway enrichment analyses on the differentially expressed genes (with absolute log2-Fold Change value higher than 1.5 and adjusted p value lower than 0.05). The tertiary-analysis part has been developed exclusively for the human genome at the moment, although the rest of the pipeline can also work for different genomes. One of the future goals is to expand the genomes available also for tertiary-analysis.

Finally, the results obtained and saved in the appropriate folders can be viewed in an interactive Shiny App [40], from which you can also download a report with all the results. The advantage of showing the results in this form is that, once the Shiny App is launched, it is intuitive and easy to use even for those who are not familiar with computer science. The Shiny App shows the results of the RNA-Seq analysis divided into a series of tabs for each phase: the summary tab contains two tables that show the initial setting parameters and details about pre-processing on the FastQ files; the FastQ quality tab contains the FastQC and FastQ-Screen outputs; the BAM quality tab contains the RSeQC outputs obtained from the quality analysis on the aligned files; the differential expression analysis tab contains a result table and a series of plots and in particular a PCA, a volcano plot, a heatmap of the 100 genes with greater variance, a heatmap showing the distances between the samples; the Tertiary-analysis tab is divided into two sub-tab, one for GO analysis and the other for Pathway analysis, both containing a result table and a series of dotplots and interactive network plots (see Additional file 1 for more details).

Although ARPIR is not ready for High Performance Computing architecture (HPC), there are no limits for sample management. The memory management that ARPIR uses is very optimized and multiple sequencing runs can be managed in parallel, regardless of the number of samples.

RNA-Seq runs with 10–15 samples can be analyzed even in a normal personal computer with 8–16 GB of RAM. For larger runs, the use of a workstation with 64 GB of RAM is recommended.

Other aspects of interest in the use of a tool for RNA-Seq analysis are reproducibility and transparency, important elements especially in view of scientific publications, where often it is required to describe in detail each step of the analysis [41,42,43]. However, this is not considered in most tools. The analysis performed with ARPIR is entirely reproducible since the code is open access and the individual steps are executed by different scripts, so that it is also possible to reproduce them separately. Furthermore, to offer the greatest possible clarity regarding the steps addressed, the log produced during the analysis is saved in the selected folder. Finally, in the markdown report that can be downloaded from the Shiny App, all the parameters used (both optional and default) are reported with an explanation of their function, as well as a brief description of each phase with related outputs.

Conclusions

In order to identify a reference pipeline for RNA-Seq analysis, we evaluated some of the most used tools by combining them in eight different pipelines. We considered consumption in terms of time and RAM memory and we also evaluated the sensitivity and specificity of the different pipelines through RNA-Seq simulation data. What has emerged is that the HISAT2-featureCounts-edgeR pipeline is the best in terms of time, RAM consumption and sensitivity, but good performances are also in the HISAT2-featureCounts-DESeq2, in the TopHat2 and in the STAR pipelines.

We have therefore developed a single tool, ARPIR, which contains the six pipelines and automatically performs the entire standard analysis, from quality control on FastQ files to Pathway and Gene Ontology analyses. In addition, multiple comparisons can be made between different groups within a single run. In order to make ARPIR easier to use even for those not familiar with computer science, we have also provided a configuration script for installing prerequisites and the possibility of being launched through a graphical interface (Additional file 11), as well as from the command line using a template script. Furthermore, to make the results more easily accessible, we have developed an interactive Shiny App, from which it is also possible to download a summary report. These features make ARPIR a complete tool, easy to use and a reference point for our institute in the field of RNA-Seq analysis [44].